unstable issue in the training process

LuLing06 commented 1 year ago

Describe the bug I have used nerfacto, nerfacto-big, and instant-ngp three models for my dataset. I found that the training process was unstable. It looks this way . Is there any implementation issue? does nerfstudio implement the early stopping?

tancik commented 1 year ago

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

LuLing06 commented 1 year ago

Thanks for your explanation. I have found the possible reason. It might be some issues with the resume process. I found when I resumed the training, the learning rate would go back to the default (0.01). It does not load the final learning rate in the last step of the previous train. There is the picture:

I used the code to resume training: ns-train nerfacto --experiment-name $exp_name --timestamp $timestamp --data $data --load-dir $resume_dir --output-dir $output_dir --max-num-iterations $iterations --vis $vis Note: resume_dir=$output_dir/$exp_name/$exp_name/$timestamp/nerfstudio_models

How can I resume the lr from the latest checkpoint?

The models are typically trained for fewer iterations, ie nerfacto is only set to train for 30k iters. There are some instabilities that emerge when training for a long time. We have tried to look into them but have so far been unsucessful.

viridityzhu commented 6 months ago

Hi, I meet a similar issue in that after reloading a checkpoint, the model performance drops (p1). I checked that learning rates were correctly loaded (p2). But there seemed to be other issues with the loading, as the train losses camera_opt_regularizer and rgb_loss dropped a lot (p3). p1:

p2:

p3:

The loading command is ns-train nerfacto --load-dir outputs/processed/nerfacto/2024-02-29_175948/nerfstudio_models --data test/multiview_train_data/32/processed --vis wandb --max-num-iterations 60000 Is there any solution for this issue?

nerfstudio-project / nerfstudio

unstable issue in the training process #2221