nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.6k stars 1.31k forks source link

Loading a checkpoint trains past 100% #3096

Open sam598 opened 7 months ago

sam598 commented 7 months ago

Describe the bug Loading previously trained checkpoint will train past --max-num-iterations.

To Reproduce Steps to reproduce the behavior:

  1. Train a model (I used Splatfacto)
  2. Train it again with --load-dir
  3. Have --max-num-iterations set to higher than the saved checkpoint
  4. The model will not stop training for 24+ hours.

Expected behavior The training should stop when it reaches --max-num-iterations

Additional context I trained a Splatfacto model to 10,000 steps. Afterwards I loaded the saved checkpoint with --load-dir, and set --max-num-iterations to 20,000.

It starting training with an output that looked like this:

10090 (50.10%) 3m 45s

When it approached 20,000 steps it looked like this:

19810 (98.80%) 10s

Then it keeps training, apparently for 24 hours if unstopped.

20090 (101.00%) 23h 59m 40s

If --max-num-iterations was not set, or set lower than the checkpoint steps, I get why it would train indefinitely. But the logical (and more useful) behavior would be for it to train to the defined value.

What is perplexing is looking at the code for trainer.py this is does not seem like it should be possible. This code looks like it should run how I am expecting it to.

https://github.com/nerfstudio-project/nerfstudio/blob/eddf2d21b5f568eb3370426b3d95e2501788752c/nerfstudio/engine/trainer.py#L241

Where is this indefinite training coming from?

preacherwhite commented 1 week ago

I added a new pull request which directly addresses this issue, I think it is due to a bug in setting iterations after loading checkpoints