nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
8.87k stars 1.18k forks source link

Splatfacto fails due to the model being train mode instead of eval mode #3253

Open kstoneriv3 opened 5 days ago

kstoneriv3 commented 5 days ago

Describe the bug After 2000 steps of ns-train command (nerfstudio==1.1.3), the following happens.

Printing profiling stats, from longest to shortest duration in seconds
VanillaPipeline.get_average_eval_image_metrics: 0.2351
VanillaPipeline.get_average_image_metrics: 0.2226
VanillaPipeline.get_eval_image_metrics_and_images: 0.0689
Trainer.train_iteration: 0.0501
VanillaPipeline.get_train_loss_dict: 0.0414
Trainer.eval_iteration: 0.0010
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 298, in train
    self.eval_iteration(step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/decorators.py", line 70, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 545, in eval_iteration
    metrics_dict, images_dict = self.pipeline.get_eval_image_metrics_and_images(step=step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 341, in get_eval_image_metrics_and_images
    metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/splatfacto.py", line 926, in get_image_metrics_and_images
    combined_rgb = torch.cat([gt_rgb, predicted_rgb], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 239 but got size 959 for tensor number 1 in the list.

To Reproduce It happens locally while using the viewer but I don't have a simple way of reproduction.

Expected behavior No RuntimeError is raised here.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context When I set the PDB debugger there for quick debugging, I saw that the model is in train mode instead of eval mode somehow. Replacing https://github.com/nerfstudio-project/nerfstudio/blob/9b3cbc79bf239eb3c69e7c288632aab02c4f0bb1/nerfstudio/pipelines/base_pipeline.py#L341 with the following fixed the error for me but it probably needs more permanent fix.


        try:
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)
        except Exception:
            self.model.eval()  # The code fails here due to model.training == True for some reason.
            metrics_dict, images_dict = self.model.get_image_metrics_and_images(outputs, batch)

It is cryptic why the model is still in train mode at this line but that's how it worked for me.

I had the same error at the following line and the same fix worked. https://github.com/nerfstudio-project/nerfstudio/blob/9b3cbc79bf239eb3c69e7c288632aab02c4f0bb1/nerfstudio/pipelines/base_pipeline.py#L388

jb-ye commented 3 days ago

I have seen similar issue before and don't yet have clue how viewer can mutate training/eval mode of models.