google / nerfactor

Neural Factorization of Shape and Reflectance Under an Unknown Illumination
https://xiuming.info/projects/nerfactor/
Apache License 2.0
440 stars 56 forks source link

NaN or Inf in 'Albedo' at step II. Joint Optimization #19

Closed CorneliusHsiao closed 2 years ago

CorneliusHsiao commented 2 years ago

Hi,

Great work. I am training your model on my own dataset in real-data format. However, it always reports the following error message when processing step II. Joint Optimization in Training, Validation, and Testing. Could you provide me some insight about what configuration/data format might be wrong?

Error message

Exception has occurred: InvalidArgumentError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
2 root error(s) found.
  (0) Invalid argument:  Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
     [[{{node cond/else/_1/StatefulPartitionedCall/gradient_tape/model/CheckNumerics_2}}]]
     [[cond/else/_1/StatefulPartitionedCall/replica_1/model/assert_greater_3/Assert/AssertGuard/branch_executed/_57539/_6203]]
  (1) Invalid argument:  Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
     [[{{node cond/else/_1/StatefulPartitionedCall/gradient_tape/model/CheckNumerics_2}}]]
0 successful operations.
3 derived errors ignored. [Op:__inference_fn_with_cond_190304]

Function call stack:
fn_with_cond -> fn_with_cond
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
    self.captured_inputs)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 708, in _call
    return function_lib.defun(fn_with_cond)(*canon_args, **canon_kwds)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/admin/FaceReal/nerfactor/nerfactor/trainvali.py", line 181, in main
    strategy, model, batch, optimizer, global_bs_train)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/admin/FaceReal/nerfactor/nerfactor/trainvali.py", line 341, in <module>
    app.run(main)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 193, in _run_module_as_main (Current frame)
    "__main__", mod_spec)

My dataset directory looks like

root
│   transforms_test.json
│   transforms_train.json
│   transforms_val.json
│   
├───test_000
│       metadata.json
│       nn.png
│       rgba.png
│       
├───train_000
│       albedo.png
│       metadata.json
│       rgba.png
│       
├───train_001
│       metadata.json
│       rgba.png
│       
├───train_002
│       metadata.json
│       rgba.png
│       
├───train_003
│       metadata.json
│       rgba.png
│       
├───train_004
│       metadata.json
│       rgba.png
│       
├───train_005
│       metadata.json
│       rgba.png
│       
├───train_006
│       metadata.json
│       rgba.png
│       
├───train_007
│       metadata.json
│       rgba.png
│       
├───train_008
│       metadata.json
│       rgba.png
│       
├───train_009
│       metadata.json
│       rgba.png
│       
└───val_000
        metadata.json
        rgba.png
xiumingzhang commented 2 years ago

Sorry for the delayed response. Have you solved the problem? This issue seems to be related to specific values in your data. Please feel free to reopen this if you need further help.

hongsiyu commented 2 years ago

@CorneliusHsiao I also have the same problems as you. Have you solved it?