kakaxi314 / BP-Net

Implementation of our paper 'Bilateral Propagation Network for Depth Completion'
MIT License
66 stars 6 forks source link

General Question about Generalization #3

Open eriksandstroem opened 4 months ago

eriksandstroem commented 4 months ago

Hi, Thanks for providing this awesome work open source!

I would like to ask for your advice on something. I have input RGB and sparse depth which typically looks like this (from ScanNet). Here, the depth is not from the depth sensor, but computed with another technique. I wonder, do you think that your model, presumably the one trained on the indoor NYU dataset, could complete these depth maps?

I will nevertheless test it, but would be very grateful for any advice you may have on this! I suspect that one issue may be that the sparse depth maps you trained on were more uniformly sampled, while mine is not necessarily uniform. I do have access to depth from the other pixels, but it is not reliable so I would like to replace it with depth from your method if possible.

image

Best, Erik

kakaxi314 commented 4 months ago

Thanks for your interest in our method. I remember that it has been verified by other researchers that current depth completion methods don't have very good generalization capability on various sample patterns (sorry that I forgot from which paper). Thus, I think our method may not perform very well on your examples because of the sparse depth pattern difference from our training data. But I also believe our method should produce a not bad dense depth map, as the proposed BP module strives to work at the pre-processing stage to alleviate this kind of issue. Finally, I guess adding appropriate data augmentation while training might improve the performance on your example.

eriksandstroem commented 4 months ago

Thanks for your reply!

I checked the paper for the runtime, but could not properly understand the evaluation. How long does the prediction of a frame of size e.g. 320x640 take on average?

As I understand it, the input side length of the images needs to be a multiple of 32. I am using images of size 320x640 as well as 240x320 and 384x512. All of these dimensions except 240 is divisible by 32, so I suppose that I need to pad the image to 256x320 like you say in the paper. As I understand it you do edge padding for the rgb and constant padding for the input depth. I see that you are only evaluating on a center crop in the experiments, but I suppose that the method could work without needing this center cropping i.e. if the input resolution is already divisible by 32, I should not need to pad extra, right?

One more thing: I did some tests on the system and it seems like it is sensitive to the scale of the depth. For example, I ran testing with dividing the input depth by 10 and I see a degradation in performance. I suppose it is a good idea to rescale the input depth to have a mean value around at least 3 meters for indoor room sized scenes or something? If you have any insights on this, that would be great! I see that the NYU dataset depth is multiplied by 10 for example.

eriksandstroem commented 4 months ago

Update: I tested with simply cropping some of the GT input depth to see how it handles less uniform samples. It turns out that it does not work so well - as we were expecting. Do you think it is worthwhile retraining the network with another input depth distribution or do you think more work is needed to make this scenario work i.e. perhaps modifications to the design of the system etc.? That would likely be too much commitment for me at this moment in time so would be super grateful if you have some intuition on this.

Cheers, Erik

Screenshot 2024-04-15 7 18 41 PM

kakaxi314 commented 4 months ago

Sorry for the late reply. I list my thoughts for your questions as follows:

  1. We list the runtime for the processed NYUV2 data in Table 3 of the main paper. Before benchmarking runtime, we use torch.compile to pre-compile the network. You can benchmark on your device for images of any resolution you want.
  2. Yes, if the input resolution is already divisible by 32, you don't need to pad it.
  3. Your idea of rescaling input depth may be useful for the scale sensitivity issue.
  4. I don't know whether retraining will yield satisfactory results, but it should be better.
JinhwiPark commented 3 months ago

Hello,

I'm currently focusing on improving the generalization capabilities of depth completion methods to handle various sparse sampling patterns like the ones you've encountered. My latest research, which I will be presenting at CVPR24, specifically addresses these challenges. The paper introduces a method designed to robustly adapt to different sparse depth configurations, which might be particularly beneficial for your scenario. You can access the paper at this URL: https://arxiv.org/abs/2405.11867. Additionally, I'll upload the code to my GitHub soon, which may provide a practical solution for your experimentation.

Thanks.