Describe the bug
After 2000 steps of ns-train command (nerfstudio==1.1.3), the following happens.
Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.2351
VanillaPipeline.get_average_image_metrics: 0.2226
VanillaPipeline.get_eval_image_metrics_and_images: 0.0689
Trainer.train_iteration: 0.0501
VanillaPipeline.get_train_loss_dict: 0.0414
Trainer.eval_iteration: 0.0010
Traceback (most recent call last):
File "/usr/local/bin/ns-train", line 8, in <module>
sys.exit(entrypoint())
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
main(
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 247, in main
launch(
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 100, in train_loop
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 298, in train
self.eval_iteration(step)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/decorators.py", line 70, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 545, in eval_iteration
metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 341, in get_eval_image_metrics_and_images
metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/splatfacto.py", line 926, in get_image_metrics_and_images
combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 239 but got size 959 for tensor number 1 in the list.
To Reproduce
It happens locally while using the viewer but I don't have a simple way of reproduction.
Expected behavior
No RuntimeError is raised here.
Screenshots
If applicable, add screenshots to help explain your problem.
try:
metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
except Exception:
self.model.eval() # The code fails here due to model.training == True for some reason.
metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
It is cryptic why the model is still in train mode at this line but that's how it worked for me.
Describe the bug After 2000 steps of ns-train command (nerfstudio==1.1.3), the following happens.
To Reproduce It happens locally while using the viewer but I don't have a simple way of reproduction.
Expected behavior No RuntimeError is raised here.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context When I set the PDB debugger there for quick debugging, I saw that the model is in train mode instead of eval mode somehow. Replacing https://github.com/nerfstudio-project/nerfstudio/blob/9b3cbc79bf239eb3c69e7c288632aab02c4f0bb1/nerfstudio/pipelines/base_pipeline.py#L341 with the following fixed the error for me but it probably needs more permanent fix.
It is cryptic why the model is still in train mode at this line but that's how it worked for me.
I had the same error at the following line and the same fix worked. https://github.com/nerfstudio-project/nerfstudio/blob/9b3cbc79bf239eb3c69e7c288632aab02c4f0bb1/nerfstudio/pipelines/base_pipeline.py#L388