google-research / multinerf

A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF
Apache License 2.0
3.56k stars 338 forks source link

Out of memory while trying to allocate 58796148776 bytes during training #110

Open kenchen3000 opened 1 year ago

kenchen3000 commented 1 year ago

Hi, I have a Out of GPU memory when trying to train 360 data such as Bonsai, stump. The command is "python -m train --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = '${DATA_DIR}'" --gin_bindings="Config.checkpoint_dir = '${DATA_DIR}/checkpoints'" --logtostderr"

Running on wsl (utuntu 20.04)

The output error is below.

Any suggestion? thanks!

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/c/gitcode/multinerf/train.py", line 288, in app.run(main) File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/mnt/c/gitcode/multinerf/train.py", line 119, in main state, stats, rngs = train_pstep( jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 58796148776 bytes. BufferAssignment OOM Debugging. BufferAssignment stats: parameter allocation: 104.27MiB constant allocation: 128.5KiB maybe_live_out allocation: 103.09MiB preallocated temp allocation: 54.76GiB preallocated temp fragmentation: 232B (0.00%) total allocation: 54.86GiB total fragmentation: 30.49MiB (0.05%)

deeepwin commented 1 year ago

I had the same issue. Reduced batch_size to match available GPU memory helped in my case. See similar ticket here. Also check README OOM Error section.