heyoeyo / muggled_dpt

Muggled DPT: Depth estimation without the magic
Apache License 2.0
41 stars 3 forks source link

Any consideration to support the Metric Depth models of DepthAnything? #3

Closed mosaabadil closed 4 months ago

mosaabadil commented 4 months ago

Hi Ill like to first thank you for the work you did. Was struggling to get proper live-camera inference to work with sufficient fps but your work made it much faster. I had successfully tired out using the VITSmall model but when i wanted to attempt using the Indoor_Metric_Depth model (from here) i get this error: "NotImplementedError: Bad model type: unknown, no support for this yet!"

Is there work being done to add support to the metric depth models?

heyoeyo commented 4 months ago

Hi @mosaabadil, thanks for checking out the repo!

Unfortunately for the foreseeable future, I don't have any plans to add support for the metric depth models. At the very least, I'd like to wait for a possible update from the depth-anything implementation (mentioned here) in case there are some major code changes that would need to be accounted for/revised.

In the mean time, I think it may be possible for you to get faster inference speeds with some (very minor) modifications to the depth-anything implementation. All of the speed increases in this repo come from 3 sources:

  1. Using float16 or bfloat16 instead of the default float32
  2. Using xformers (which works best with float16)
  3. Caching the positional encodings of the image encoder

The first point in particular makes a really big difference and is easy to add. If you're already using the depth-anything metric-depth code base, you'll find a number of places where the model/image data is placed on the GPU using just .cuda() (for example within the evaluate.py script, the main function and evaluate function both do this).

If you replace every .cuda() call with .to("cuda", dtype=torch.float16), you should find that the model runs ~2x faster without a significant change to the output predictions. If you do have issues with the output (i.e. if it turns to a fully black screen), you can try using torch.bfloat16 instead.

If you have xformers installed (using pip install xformers==0.0.25 for example), you may get an additional 10-20% speed up (more so at higher input image resolutions) when switching to float16.

The last point (caching) requires more involved changes to the code base. However, it doesn't provide much of a benefit to the depth-anything models (compared to the older midas models, which benefit greatly), so it's probably not as worthwhile to bother with.

mosaabadil commented 4 months ago

I see, makes sense. Thanks for the input and tips!