google-research / multinerf

A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF
Apache License 2.0
3.56k stars 338 forks source link

training toaster scence with config blender_refnerf.gin #131

Closed prakashknaikade closed 9 months ago

prakashknaikade commented 10 months ago

All the tests are working fine.

Should I change anything in blender_refnerf.gin config file??

with original config I am getting OOM error:

SCENE=toaster           1 ↵
EXPERIMENT=shinyblender
DATA_DIR=/HPS/ColorNeRF/work/ref_nerf_dataset/
CHECKPOINT_DIR=/HPS/ColorNeRF/work/multinerf/results/"$EXPERIMENT"/"$SCENE"

rm "$CHECKPOINT_DIR"/*
python -m train \
  --gin_configs=configs/blender_refnerf.gin \
  --gin_bindings="Config.data_dir = '${DATA_DIR}/${SCENE}'" \
  --gin_bindings="Config.checkpoint_dir = '${CHECKPOINT_DIR}'" \
  --logtostderr
zsh: sure you want to delete all 2 files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster [yn]? y

2023-08-22 18:22:37.196800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0822 18:22:41.638901 140534443848896 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0822 18:22:41.639630 140534443848896 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/xla_bridge.py:835: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
  warnings.warn(
Number of parameters being optimized: 713230
I0822 18:23:27.852782 140534443848896 checkpoints.py:1054] Found no checkpoint files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster with prefix checkpoint_
2023-08-22 18:24:11.556972: W external/xla/xla/service/hlo_rematerialization.cc:2202] Can't reduce memory use below 35.59GiB (38215385088 bytes) by rematerialization; only reduced to 52.52GiB (56396688164 bytes), down from 59.37GiB (63752162188 bytes) originally
warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'

2023-08-22 18:24:28.119826: W external/tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 53.85GiB (rounded to 57819280896)requested by op 
2023-08-22 18:24:28.120719: W external/tsl/tsl/framework/bfc_allocator.cc:497] *___________________________________________________________________________________________________
2023-08-22 18:24:28.123621: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2593] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 57819280688 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    8.89MiB
              constant allocation:   130.6KiB
        maybe_live_out allocation:    8.17MiB
     preallocated temp allocation:   53.85GiB
  preallocated temp fragmentation:  992.38MiB (1.80%)
                 total allocation:   53.86GiB
              total fragmentation:  992.58MiB (1.80%)
Peak buffers:
        Buffer 1:
                Size: 1.38GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
                XLA Label: fusion
                Shape: f32[1048576,352]
                ==========================

        Buffer 2:
                Size: 1.38GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
                XLA Label: fusion
                Shape: f32[1048576,352]
                ==========================

        Buffer 3:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 4:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 5:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 6:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 7:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/select_n" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.626"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 8:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_7))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
                XLA Label: custom-call
                Shape: f32[1048576,256]
                ==========================

        Buffer 9:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 10:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 11:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 12:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_2))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
                XLA Label: custom-call
                Shape: f32[1048576,256]
                ==========================

        Buffer 13:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(Dense_7))/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=456
                XLA Label: custom-call
                Shape: f32[1048576,256]
                ==========================

        Buffer 14:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

        Buffer 15:
                Size: 1.00GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/vmap(jvp(jit(relu)))/max" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=457 deduplicated_name="fusion.864"
                XLA Label: fusion
                Shape: f32[1048576,256]
                ==========================

Traceback (most recent call last):
  File "/HPS/ColorNeRF/work/multinerf/train.py", line 291, in <module>
    app.run(main)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/HPS/ColorNeRF/work/multinerf/train.py", line 120, in main
    state, stats, rngs = train_pstep(
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 166, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/api.py", line 1803, in cache_miss
    out = map_bind_continuation(execute(*tracers))
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/profiler.py", line 314, in wrapper
    return func(*args, **kwargs)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/interpreters/pxla.py", line 1229, in __call__
    results = self.xla_executable.execute_sharded(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 57819280688 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    8.89MiB
              constant allocation:   130.6KiB
        maybe_live_out allocation:    8.17MiB
     preallocated temp allocation:   53.85GiB
  preallocated temp fragmentation:  992.38MiB (1.80%)
                 total allocation:   53.86GiB
              total fragmentation:  992.58MiB (1.80%)
Peak buffers:
        Buffer 1:
                Size: 1.38GiB
                Operator: op_name="pmap(train_step)/jit(main)/jvp(Model)/NerfMLP_0/concatenate[dimension=1]" source_file="/HPS/ColorNeRF/work/multinerf/internal/models.py" source_line=459 deduplicated_name="fusion.870"
                XLA Label: fusion
                Shape: f32[1048576,352]
                ==========================

And when I change to batch_size: int = 4096 and render_chunk_size: int = 4096 in internal/configs.py,

SCENE=toaster           1 ↵
EXPERIMENT=shinyblender
DATA_DIR=/HPS/ColorNeRF/work/ref_nerf_dataset/
CHECKPOINT_DIR=/HPS/ColorNeRF/work/multinerf/results/"$EXPERIMENT"/"$SCENE"

rm "$CHECKPOINT_DIR"/*
python -m train \
  --gin_configs=configs/blender_refnerf.gin \
  --gin_bindings="Config.data_dir = '${DATA_DIR}/${SCENE}'" \
  --gin_bindings="Config.checkpoint_dir = '${CHECKPOINT_DIR}'" \
  --logtostderr
zsh: sure you want to delete all 2 files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster [yn]? y

2023-08-22 18:30:58.707151: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0822 18:31:02.937265 139697632249024 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0822 18:31:02.938034 139697632249024 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/xla_bridge.py:835: UserWarning: jax.host_id has been renamed to jax.process_index. This alias will eventually be removed; please update your code.
  warnings.warn(
Number of parameters being optimized: 713230
I0822 18:31:48.341920 139697632249024 checkpoints.py:1054] Found no checkpoint files in /HPS/ColorNeRF/work/multinerf/results/shinyblender/toaster with prefix checkpoint_
warning: Linking two modules of different target triples: 'LLVMDialectModule' is 'nvptx64-nvidia-gpulibs' whereas '' is 'nvptx64-nvidia-cuda'

Traceback (most recent call last):
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/HPS/ColorNeRF/work/multinerf/train.py", line 291, in <module>
    app.run(main)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/HPS/ColorNeRF/work/opt/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/HPS/ColorNeRF/work/multinerf/train.py", line 128, in main
    loss_threshold = jnp.mean(stats['loss_threshold'])
KeyError: 'loss_threshold'

what should I do?

Quieter2018 commented 10 months ago

same problem

sevashasla commented 10 months ago

I checkouted at 5d4c82831a9b94a87efada2eee6a993d530c4226, it helped. Probably one of the last commits broke the code