Describe the bug
When loading more image data than there is memory (system RAM, not VRAM) available, Nerfstudio hangs perpetually without error, rather than throwing an OOM error.
Load drops from high to zero, all the threads die off, and it just sits there perpetually.
(As a sidenote: It would be nice if the Nerfacto dataloader could be merged with the Splatfacto one, which is way way more memory efficient)
To Reproduce
Steps to reproduce the behavior:
Load a dataset with a 2000-or-so 4k images and no downscales into nerfacto(-huge).
Watch memory usage go up until full and suddenly drop.
Observe Nerfstudio hanging forever on "Loading data batch".
Expected behavior
Nerfstudio should display an error and crash out with an error code. Aside from being just more logical and informative, this would also allow external sequential batch processing pipelines to continue doing their thing.
Screenshots
Post-hang:
Additional context
Sounds like a child thread is terminating due to allocation errors but not being caught correctly.
I haven't tested this with the Splatfacto dataloader, as I have yet to find a way to run out of my 100GB RAM with Splatfacto.
Nerfacto's dataloader uses way way more memory (and briefly peaks to twice the memory it needs at some point during the loading process).
Relevant: @AntonioMacaronio has been working very hard on this in #3216. There are a ton of tradeoffs to consider and it's super involved, but it does feel like the PR is close!
Describe the bug When loading more image data than there is memory (system RAM, not VRAM) available, Nerfstudio hangs perpetually without error, rather than throwing an OOM error. Load drops from high to zero, all the threads die off, and it just sits there perpetually. (As a sidenote: It would be nice if the Nerfacto dataloader could be merged with the Splatfacto one, which is way way more memory efficient)
To Reproduce Steps to reproduce the behavior:
Expected behavior Nerfstudio should display an error and crash out with an error code. Aside from being just more logical and informative, this would also allow external sequential batch processing pipelines to continue doing their thing.
Screenshots Post-hang:
Additional context Sounds like a child thread is terminating due to allocation errors but not being caught correctly. I haven't tested this with the Splatfacto dataloader, as I have yet to find a way to run out of my 100GB RAM with Splatfacto. Nerfacto's dataloader uses way way more memory (and briefly peaks to twice the memory it needs at some point during the loading process).