Describe the bug
Loading previously trained checkpoint will train past --max-num-iterations.
To Reproduce
Steps to reproduce the behavior:
Train a model (I used Splatfacto)
Train it again with --load-dir
Have --max-num-iterations set to higher than the saved checkpoint
The model will not stop training for 24+ hours.
Expected behavior
The training should stop when it reaches --max-num-iterations
Additional context
I trained a Splatfacto model to 10,000 steps. Afterwards I loaded the saved checkpoint with --load-dir, and set --max-num-iterations to 20,000.
It starting training with an output that looked like this:
10090 (50.10%) 3m 45s
When it approached 20,000 steps it looked like this:
19810 (98.80%) 10s
Then it keeps training, apparently for 24 hours if unstopped.
20090 (101.00%) 23h 59m 40s
If --max-num-iterations was not set, or set lower than the checkpoint steps, I get why it would train indefinitely. But the logical (and more useful) behavior would be for it to train to the defined value.
What is perplexing is looking at the code for trainer.py this is does not seem like it should be possible. This code looks like it should run how I am expecting it to.
Describe the bug Loading previously trained checkpoint will train past --max-num-iterations.
To Reproduce Steps to reproduce the behavior:
Expected behavior The training should stop when it reaches --max-num-iterations
Additional context I trained a Splatfacto model to 10,000 steps. Afterwards I loaded the saved checkpoint with --load-dir, and set --max-num-iterations to 20,000.
It starting training with an output that looked like this:
10090 (50.10%) 3m 45s
When it approached 20,000 steps it looked like this:
19810 (98.80%) 10s
Then it keeps training, apparently for 24 hours if unstopped.
20090 (101.00%) 23h 59m 40s
If --max-num-iterations was not set, or set lower than the checkpoint steps, I get why it would train indefinitely. But the logical (and more useful) behavior would be for it to train to the defined value.
What is perplexing is looking at the code for trainer.py this is does not seem like it should be possible. This code looks like it should run how I am expecting it to.
https://github.com/nerfstudio-project/nerfstudio/blob/eddf2d21b5f568eb3370426b3d95e2501788752c/nerfstudio/engine/trainer.py#L241
Where is this indefinite training coming from?