Hi,
I have a Out of GPU memory when trying to train 360 data such as Bonsai, stump.
The command is "python -m train --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = '${DATA_DIR}'" --gin_bindings="Config.checkpoint_dir = '${DATA_DIR}/checkpoints'" --logtostderr"
Running on wsl (utuntu 20.04)
The output error is below.
Any suggestion? thanks!
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/c/gitcode/multinerf/train.py", line 288, in
app.run(main)
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/mnt/c/gitcode/multinerf/train.py", line 119, in main
state, stats, rngs = train_pstep(
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 58796148776 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 104.27MiB
constant allocation: 128.5KiB
maybe_live_out allocation: 103.09MiB
preallocated temp allocation: 54.76GiB
preallocated temp fragmentation: 232B (0.00%)
total allocation: 54.86GiB
total fragmentation: 30.49MiB (0.05%)
I had the same issue. Reduced batch_size to match available GPU memory helped in my case. See similar ticket here. Also check README OOM Error section.
Hi, I have a Out of GPU memory when trying to train 360 data such as Bonsai, stump. The command is "python -m train --gin_configs=configs/360.gin --gin_bindings="Config.data_dir = '${DATA_DIR}'" --gin_bindings="Config.checkpoint_dir = '${DATA_DIR}/checkpoints'" --logtostderr"
Running on wsl (utuntu 20.04)
The output error is below.
Any suggestion? thanks!
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/mnt/c/gitcode/multinerf/train.py", line 288, in
app.run(main)
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/ck/miniconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/mnt/c/gitcode/multinerf/train.py", line 119, in main
state, stats, rngs = train_pstep(
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 58796148776 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 104.27MiB
constant allocation: 128.5KiB
maybe_live_out allocation: 103.09MiB
preallocated temp allocation: 54.76GiB
preallocated temp fragmentation: 232B (0.00%)
total allocation: 54.86GiB
total fragmentation: 30.49MiB (0.05%)