Closed pablovela5620 closed 1 year ago
Looks like setting --num_workers 0
fixes the issue for me, though I still don't fully understand why
Sounds like this is a shared memory limitation on your system.
Fortunately this should be an easy fix if you increase your system's shared memory limit: https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396
If you're using docker then you should check: https://github.com/pytorch/pytorch#docker-image
and in particular the --shm-size
flag.
Yep looks like this was the issue! Thanks for the help
Great! Welcome!
I'm trying to run the test script on the 7scenes dataset, I've tried with both the standard and
fast_cost_volume
versions. This is the command I'm running (after following the preprocessing steps for 7scenes as well as tuple generation)I'm using a machine with 3 A6000 (on a vscode devcontainer) so the shared memory aspect seems weird (considering I have >40gb of vram)
This is the exact error I get