naver / r2d2

Other
461 stars 86 forks source link

Questions about the time cost of R2D2 feature extraction #9

Closed Jemmagu closed 4 years ago

Jemmagu commented 4 years ago

HI @jerome-revaud ,

  1. I tried to extract R2D2 features using default multiscale settings, and for a 640*480 image, it takes about 200ms+ (about 4 scales). So I tried single scale for the same image, it takes about 70ms+. Is the time cost normal? And can I speed up the extraction process?

  2. I surprisely found that extract 2.5k, 5k, 10k keypoints take the same time. Is this normal?

Thanks in advance! Really hope to get your reply!

MuyanXiao commented 4 years ago

Hi @Jemmagu , I'm also doing some measurement of the time for feature extraction. May I ask what machine are you using? I think I can answer your second question. From the code, I see that it always extract all features and then sort them with the score. The number of features you specify only means how many features from the sorted feature list are saved to the file. So the time would be the same for all numbers.

jerome-revaud commented 4 years ago

Hi guys @Jemmagu @MuyanXiao For the 2nd question, yes, it's as @MuyanXiao said. For the first question, it essentially depends on the GPU that you are using. From my experience, there is a factor 10 in speed depending on the GPU. It appears that some GPU are much faster with dilated convolutions, not sure why though.

Jemmagu commented 4 years ago

Hi @jerome-revaud , @MuyanXiao Thanks for your reply! I'm using Tesla P100 PCIe 16GB, running with single GPU.

@jerome-revaud

  1. Have you tested your time cost of R2D2 feature extraction?
  2. You mentioned some GPU are much faster with dilated convolutions, so in your implementation you use dilated conv, right? If so, what kind of GPU will be faster with dilated convs?

Many thanks!

jerome-revaud commented 4 years ago

Hi @Jemmagu

  1. yes, it's in the paper, but I only reported numbers for a single GPU.

In practice, processing a 1M pixel image on a Tesla P100-SXM2 GPU takes about 0.5s to extract keypoints at a single scale (full image) and 1s for all scales.

  1. there are dilated convolutions, yes, Now, I have no idea which GPU would be faster (more recent ones I suppose?)
Jemmagu commented 4 years ago

Thanks! @jerome-revaud ,

So do you have any suggestions about speed up the keypoints extraction process? As I said before, it took about 70ms to process a 640*480 image using single Tesla P100 GPU at a single scale (and 200ms at a full scale). So I wanna speed up this process (even a slight decrease in performance is acceptable if the speed becomes much faster). Do you have any suggestions?

Thanks a lot!!

jerome-revaud commented 4 years ago

Yes, it's possible. Basically, all you have to do is reduce the dilations in the first layers in the backbone network (i.e. L2-net).

For instance, if you divide by 2 the dilations in the first 2 convolution layers, the output will be 4 times smaller in each dimension (so 16x smaller in memory), and it will be much much faster (16x faster). Of course, you will lose in precision in term of the keypoint location.

So basically, in patchnet.py

class Quad_L2Net (PatchNet):
    """ Same than L2_Net, but replace the final 8x8 conv by 3 successive 2x2 convs.
    """
    def __init__(self, dim=128, mchan=4, relu22=False, **kw ):
        PatchNet.__init__(self, **kw)
        self.dilated = False # disable dilation temporarily
        self._add_conv(  8*mchan)
        self._add_conv(  8*mchan)
        self._add_conv( 16*mchan, stride=2) # resolution divided by 2x2=4
        self._add_conv( 16*mchan)
        self._add_conv( 32*mchan, stride=2) # resolution divided by 2x2=4
        self.dilated = True # re-enable dilation 
        self._add_conv( 32*mchan)
        # replace last 8x8 convolution with 3 2x2 convolutions
        self._add_conv( 32*mchan, k=2, stride=2, relu=relu22)
        self._add_conv( 32*mchan, k=2, stride=2, relu=relu22)
        self._add_conv(dim, k=2, stride=2, bn=False, relu=False)
self.out_dim = dim

Also, don't forget to multiply the (x,y) keypoint location by 4 in extract.py, since now the output is 4 times smaller than the input. Disclaimer: i didn't test, but it should be close enough to a solution.