shariqfarooq123 / AdaBins

Official implementation of Adabins: Depth Estimation using adaptive bins
GNU General Public License v3.0
725 stars 156 forks source link

RuntimeError: The size of tensor a (4096) must match the size of tensor b (500) at non-singleton dimension 2 #67

Open henanjun opened 2 years ago

henanjun commented 2 years ago

I try to inference a new image with size (2048. 2048), it raises such a problem.

HugeBob commented 2 years ago

I am having the same error with images of size 1920x1080 but "The size of tensor a (1980) must match the size of tensor b (500) at non-singleton dimension 2"

shariqfarooq123 commented 1 year ago

This is because there are only 500 learned positional encodings and if you try to infer an image much higher than the default model resolution, then the number of tokens in the transformer would increase beyond 500 and you will get the error specified above.

Proposed resolutions:

  1. (Recommended) Resize your image down to the model resolution (NYU: 640x480, KITTI: 1241x376) and upsample (e.g. bilinear interpolation) the result back to your resolution of choice.
  2. Interpolate positional encodings to the required size.
  3. Manually remove the positional encodings from the architecture and check the result. I have observed that positional encodings don't really add much to the performance.
  4. If you have a custom high resolution depth dataset, fine-tune new larger number of positional encodings (>500, total = HW/256, where 256=16x16=patch_size x patch_size)
zydmtaichi commented 1 month ago

This is because there are only 500 learned positional encodings and if you try to infer an image much higher than the default model resolution, then the number of tokens in the transformer would increase beyond 500 and you will get the error specified above.

Proposed resolutions:

  1. (Recommended) Resize your image down to the model resolution (NYU: 640x480, KITTI: 1241x376) and upsample (e.g. bilinear interpolation) the result back to your resolution of choice.
  2. Interpolate positional encodings to the required size.
  3. Manually remove the positional encodings from the architecture and check the result. I have observed that positional encodings don't really add much to the performance.
  4. If you have a custom high resolution depth dataset, fine-tune new larger number of positional encodings (>500, total = HW/256, where 256=16x16=patch_size x patch_size)

hi @shariqfarooq123 : could you please share more details about resolution 3rd? I want to keep the imgs resolution but still confused about how to remove positional encodings from the repo's infer program.