boxes_data is based on input images coordinate not feature map？

Theoretically, I believe the layers can be used in either order. However, if you are using ROI for FasterRCNN or a similar application, it will be much more computationally efficient to do the ROI align step after the convolution, because the convolution steps are the most computationally expensive. If each region of interest had to run through these layers independently, you'd create a massive backlog (as there are generally thousands of region proposals in FasterRCNN). By contrast, if you do the ROIAlign afterwards, you only have to compute convolutional features once (albeit on a slightly larger image).

longcw / RoIAlign.pytorch

boxes_data is based on input images coordinate not feature map？ #17