nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.28k stars 1.25k forks source link

unstable issue in the training process #2221

Open LuLing06 opened 1 year ago

LuLing06 commented 1 year ago

Describe the bug I have used nerfacto, nerfacto-big, and instant-ngp three models for my dataset. I found that the training process was unstable. It looks this way image. Is there any implementation issue? does nerfstudio implement the early stopping?

tancik commented 1 year ago

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

LuLing06 commented 1 year ago

Thanks for your explanation. I have found the possible reason. It might be some issues with the resume process. I found when I resumed the training, the learning rate would go back to the default (0.01). It does not load the final learning rate in the last step of the previous train. There is the picture: image

I used the code to resume training: ns-train nerfacto --experiment-name $exp_name --timestamp $timestamp --data $data --load-dir $resume_dir --output-dir $output_dir --max-num-iterations $iterations --vis $vis Note: resume_dir=$output_dir/$exp_name/$exp_name/$timestamp/nerfstudio_models

How can I resume the lr from the latest checkpoint?

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

viridityzhu commented 6 months ago

Hi, I meet a similar issue in that after reloading a checkpoint, the model performance drops (p1). I checked that learning rates were correctly loaded (p2). But there seemed to be other issues with the loading, as the train losses camera_opt_regularizer and rgb_loss dropped a lot (p3). p1:

image

p2:

image

p3:

image

The loading command is ns-train nerfacto --load-dir outputs/processed/nerfacto/2024-02-29_175948/nerfstudio_models --data test/multiview_train_data/32/processed --vis wandb --max-num-iterations 60000 Is there any solution for this issue?