isl-org / DPT

Dense Prediction Transformers
MIT License
1.96k stars 254 forks source link

Running on large images #71

Open carsonswope opened 2 years ago

carsonswope commented 2 years ago

Hi,

I want to run inference with the MiDaS model (DPT-large) on large images (2k, 4k, etc.). My GPU memory maxes out just before reaching the 2k image size.

For a CNN my solution would be to run the model on smaller patches and then assemble a larger image from those patches. To avoid artifacts from stitching the images back together, I would run the model on the full receptive area of each output patch.

It's not clear to me whether it's possible to do that with the transformer architecture. Does each output pixel have a cleanly defined 'receptive area' of input pixels?

Or if not, would you have any recommended approach for running the model on large images?

Thank you!