Splatfacto training crashes after 18k steps when training on VR-NeRF dataset

bernard0047 commented 6 months ago

Description

When training Splatfacto on a VR-NeRF dataset, it crashes after 17910 steps with this error:

File "/root/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The loss being calculated here loses its gradient after 17910 steps. I verified this by printing it out on the command line but am not sure why this is happening.

To Reproduce Steps to reproduce the behavior:

install a scene from the VR-NeRF dataset using ns-download-data eyefultower --capture-name apartment --resolution-name jpeg_1k
run ns-train splatfacto --data eyefultower/apartment/images-jpeg-1k/transforms.json

Additional context From the viewer it seems like the model isn't learning anything either. I'm also unable to run the viewer after this update

jb-ye commented 6 months ago

Let me take a look into this issue.

jb-ye commented 6 months ago

Could you try ns-train splatfacto --data eyefultower/apartment/images-jpeg-1k/transforms_300.json --logging.local-writer.max-log-size=0

Currently splatfacto doesn't support well if the training cameras are more than 1k. See https://github.com/nerfstudio-project/nerfstudio/issues/2927

nerfstudio-project / nerfstudio

Splatfacto training crashes after 18k steps when training on VR-NeRF dataset #3152