google / nerfactor

Neural Factorization of Shape and Reflectance Under an Unknown Illumination
https://xiuming.info/projects/nerfactor/
Apache License 2.0
440 stars 56 forks source link

gradient error in Joint Optimization #25

Open hongsiyu opened 2 years ago

hongsiyu commented 2 years ago

I train successfully in shape pre-training but stuck in joint optimization. 2022-09-27 02:30:25.358618: E tensorflow/core/kernels/check_numerics_op.cc:289] abnormal_detected_host @0x7f43f6808a00 = {1, 0} Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values [[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]] [[Identity_6/_372]] (1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values [[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]] 0 successful operations. 0 derived errors ignored. [Op:__inference_distributed_train_step_45946]

hongsiyu commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Jiangyu1181 commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

hongsiyu commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Jiangyu1181 commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

hongsiyu commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

hongsiyu commented 2 years ago

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

and nerfactor_mvs.ini