Running on large images

Hi,

I want to run inference with the MiDaS model (DPT-large) on large images (2k, 4k, etc.). My GPU memory maxes out just before reaching the 2k image size.

For a CNN my solution would be to run the model on smaller patches and then assemble a larger image from those patches. To avoid artifacts from stitching the images back together, I would run the model on the full receptive area of each output patch.

It's not clear to me whether it's possible to do that with the transformer architecture. Does each output pixel have a cleanly defined 'receptive area' of input pixels?

Or if not, would you have any recommended approach for running the model on large images?

Thank you!

isl-org / DPT

Running on large images #71