Should I change anything in blender_refnerf.gin config file??
with original config I am getting OOM error:
SCENE=toaster 1 ↵
EXPERIMENT=shinyblender
DATA_DIR=/HPS/ColorNeRF/work/ref_nerf_dataset/
CHECKPOINT_DIR=/HPS/ColorNeRF/work/multinerf/results/"$EXPERIMENT"/"$SCENE"
rm "$CHECKPOINT_DIR"/*
python -m train \
--gin_configs=configs/blender_refnerf.gin \
--gin_bindings="Config.data_dir = '${DATA_DIR}/${SCENE}'" \
--gin_bindings="Config.checkpoint_dir = '${CHECKPOINT_DIR}'" \
--logtostderr
zsh: sure you want to delete all 2 files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster [yn]? y
2023-08-22 18:22:37.196800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0822 18:22:41.638901 140534443848896 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0822 18:22:41.639630 140534443848896 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/xla_bridge.py:835: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
warnings.warn(
Number of parameters being optimized: 713230
I0822 18:23:27.852782 140534443848896 checkpoints.py:1054] Found no checkpoint files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster with prefix checkpoint_
2023-08-22 18:24:11.556972: W external/xla/xla/service/hlo_rematerialization.cc:2202] Can't reduce memory use below 35.59GiB (38215385088 bytes) by rematerialization; only reduced to 52.52GiB (56396688164 bytes), down from 59.37GiB (63752162188 bytes) originally
warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'
2023-08-22 18:24:28.119826: W external/tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 53.85GiB (rounded to 57819280896)requested by op
2023-08-22 18:24:28.120719: W external/tsl/tsl/framework/bfc_allocator.cc:497] *___________________________________________________________________________________________________
2023-08-22 18:24:28.123621: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2593] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 57819280688 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 8.89MiB
constant allocation: 130.6KiB
maybe_live_out allocation: 8.17MiB
preallocated temp allocation: 53.85GiB
preallocated temp fragmentation: 992.38MiB (1.80%)
total allocation: 53.86GiB
total fragmentation: 992.58MiB (1.80%)
Peak buffers:
Buffer 1:
Size: 1.38GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
XLA Label: fusion
Shape: f32[1048576,352]
==========================
Buffer 2:
Size: 1.38GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
XLA Label: fusion
Shape: f32[1048576,352]
==========================
Buffer 3:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 4:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 5:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 6:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 7:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 8:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_7))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
XLA Label: custom-call
Shape: f32[1048576,256]
==========================
Buffer 9:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 10:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 11:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 12:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_2))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
XLA Label: custom-call
Shape: f32[1048576,256]
==========================
Buffer 13:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_7))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
XLA Label: custom-call
Shape: f32[1048576,256]
==========================
Buffer 14:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Buffer 15:
Size: 1.00GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
XLA Label: fusion
Shape: f32[1048576,256]
==========================
Traceback (most recent call last):
File "/HPS/ColorNeRF/work/multinerf/train.py", line 291, in <module>
app.run(main)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/HPS/ColorNeRF/work/multinerf/train.py", line 120, in main
state, stats, rngs = train_pstep(
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 166, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/api.py", line 1803, in cache_miss
out = map_bind_continuation(execute(*tracers))
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/profiler.py", line 314, in wrapper
return func(*args, **kwargs)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/interpreters/pxla.py", line 1229, in __call__
results = self.xla_executable.execute_sharded(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 57819280688 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 8.89MiB
constant allocation: 130.6KiB
maybe_live_out allocation: 8.17MiB
preallocated temp allocation: 53.85GiB
preallocated temp fragmentation: 992.38MiB (1.80%)
total allocation: 53.86GiB
total fragmentation: 992.58MiB (1.80%)
Peak buffers:
Buffer 1:
Size: 1.38GiB
Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
XLA Label: fusion
Shape: f32[1048576,352]
==========================
And when I change to batch_size: int = 4096 and render_chunk_size: int = 4096 in internal/configs.py,
SCENE=toaster 1 ↵
EXPERIMENT=shinyblender
DATA_DIR=/HPS/ColorNeRF/work/ref_nerf_dataset/
CHECKPOINT_DIR=/HPS/ColorNeRF/work/multinerf/results/"$EXPERIMENT"/"$SCENE"
rm "$CHECKPOINT_DIR"/*
python -m train \
--gin_configs=configs/blender_refnerf.gin \
--gin_bindings="Config.data_dir = '${DATA_DIR}/${SCENE}'" \
--gin_bindings="Config.checkpoint_dir = '${CHECKPOINT_DIR}'" \
--logtostderr
zsh: sure you want to delete all 2 files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster [yn]? y
2023-08-22 18:30:58.707151: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0822 18:31:02.937265 139697632249024 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0822 18:31:02.938034 139697632249024 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/xla_bridge.py:835: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
warnings.warn(
Number of parameters being optimized: 713230
I0822 18:31:48.341920 139697632249024 checkpoints.py:1054] Found no checkpoint files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster with prefix checkpoint_
warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'
Traceback (most recent call last):
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/HPS/ColorNeRF/work/multinerf/train.py", line 291, in <module>
app.run(main)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/HPS/ColorNeRF/work/multinerf/train.py", line 128, in main
loss_threshold = jnp.mean(stats['loss_threshold'])
KeyError: 'loss_threshold'
All the tests are working fine.
Should I change anything in blender_refnerf.gin config file??
with original config I am getting OOM error:
And when I change to batch_size: int = 4096 and render_chunk_size: int = 4096 in internal/configs.py,
what should I do?