nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.55k stars 1.3k forks source link

Nerfacto dataloader hangs infinitely on OOM instead of crashing during data loading #3424

Open SharkWipf opened 2 months ago

SharkWipf commented 2 months ago

Describe the bug When loading more image data than there is memory (system RAM, not VRAM) available, Nerfstudio hangs perpetually without error, rather than throwing an OOM error. Load drops from high to zero, all the threads die off, and it just sits there perpetually. (As a sidenote: It would be nice if the Nerfacto dataloader could be merged with the Splatfacto one, which is way way more memory efficient)

To Reproduce Steps to reproduce the behavior:

  1. Load a dataset with a 2000-or-so 4k images and no downscales into nerfacto(-huge).
  2. Watch memory usage go up until full and suddenly drop.
  3. Observe Nerfstudio hanging forever on "Loading data batch".

Expected behavior Nerfstudio should display an error and crash out with an error code. Aside from being just more logical and informative, this would also allow external sequential batch processing pipelines to continue doing their thing.

Screenshots Post-hang: image

Additional context Sounds like a child thread is terminating due to allocation errors but not being caught correctly. I haven't tested this with the Splatfacto dataloader, as I have yet to find a way to run out of my 100GB RAM with Splatfacto. Nerfacto's dataloader uses way way more memory (and briefly peaks to twice the memory it needs at some point during the loading process).

brentyi commented 2 months ago

Relevant: @AntonioMacaronio has been working very hard on this in #3216. There are a ton of tradeoffs to consider and it's super involved, but it does feel like the PR is close!