nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.48k stars 1.29k forks source link

Debugging training in Google Colab #1113

Open vvdb101 opened 1 year ago

vvdb101 commented 1 year ago

How exactly does debugging in nerfstudio work, particularly in Google Colab?

Specifically, I currently want to test my dataparser for a new dataset, which I implemented following the nerfstudio Developer Guides, including the necessary changes to configs and such. When I start training via the ns-train command, training runs for only a few seconds before stopping without any meaningful error messages (see screenshot below). A similar outcome is given when using the test_train.py script. Also, I can't print anything to the console with print() as a last resort, at least not downstream from the training_loop call.

I believe this has something to do with the tyro CLI setup, which I am not awfully familiar with. I am sure I am missing something obvious, but can't quite figure out how to get meaningful debugging output in Google Colab. Unfortunately, since I don't have any other access to GPUs, I am relying on the colab setup. Also, I am not sure if running the training on e.g. Linux via the terminal would yield a different output.

Any pointers would be hugely appreciated!

Screenshot A (running ns-train) image

Screenshot B (running scripts/test_train.py): image

P.S. Not sure if this issue section is the right place to ask this. I also tried joining your Discord, but the link seems to be invalid.

akristoffersen commented 1 year ago

This is an ongoing issue unfortunately with Colab. I it might have something to do with our rich console output system. Perhaps there's a limit to the rate of console output for Colab or something. From my experience, if you leave it training for a while the cell output eventually shows up, though this is also a bit spotty.

running on a linux command line should work fine, I think this is just an issue within Colab.

vvdb101 commented 1 year ago

Ok, thanks for the context!

I am aware (e.g. through training nerfacto on the provided "poster" example dataset) that it takes a while before the training epochs are printed to the console, but when I try to debug my faulty code, the cell actually terminates after e.g. 8 seconds without any further output. From that I would think that it's not just a bandwidth issue, but not sure.

Anyways, thanks for the feedback.

vvdb101 commented 1 year ago

Helpful observation: If you have Colab Pro, you can execute the commands in Colab's console and get more detailed error messages.

m0o0scar commented 1 year ago

Hi guys. I'm having the same issue on colab. The training stopped after the code block run for a few seconds. I run the same ns-train command in colab terminate and found this error:

...
Sending ping to the viewer Bridge Server...
Successfully connected.
Sending ping to the viewer Bridge Server...
Successfully connected.
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
Disabled tensorboard/wandb event writers
Printing profiling stats, from longest to shortest duration in seconds
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/site-packages/scripts/train.py", line 247, in entrypoint
    main(
  File "/usr/local/lib/python3.8/site-packages/scripts/train.py", line 233, in main
    launch(
  File "/usr/local/lib/python3.8/site-packages/scripts/train.py", line 172, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.8/site-packages/scripts/train.py", line 87, in train_loop
    trainer.train()
  File "/usr/local/lib/python3.8/site-packages/nerfstudio/engine/trainer.py", line 203, in train
    self._init_viewer_state()
  File "/usr/local/lib/python3.8/site-packages/nerfstudio/utils/decorators.py", line 58, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/nerfstudio/engine/trainer.py", line 287, in _init_viewer_state
    self.viewer_state.init_scene(
  File "/usr/local/lib/python3.8/site-packages/nerfstudio/utils/decorators.py", line 82, in wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/nerfstudio/viewer/server/viewer_utils.py", line 357, in init_scene
    self.vis["renderingState/export_path"].write(timestamp_match[-1])
IndexError: list index out of range

I believe this line of code is causing the issue: https://github.com/nerfstudio-project/nerfstudio/blob/9dc7fc2e44e8bebbe09984d815dd9e6501a6ee63/nerfstudio/viewer/server/viewer_utils.py#LL357C39-L357C39

And it can be fixed simply by checking the timestamp_match len first:

+ if timestamp_match:
    self.vis["renderingState/export_path"].write(timestamp_match[-1])
AntonioMacaronio commented 9 months ago

This is an ongoing issue unfortunately with Colab. I it might have something to do with our rich console output system. Perhaps there's a limit to the rate of console output for Colab or something. From my experience, if you leave it training for a while the cell output eventually shows up, though this is also a bit spotty.

running on a linux command line should work fine, I think this is just an issue within Colab.

This is exactly right - the issue is that python3 currently has a bug with the terminal window size, which effects the output of the rich console, which actually occurs in other libraries as well. Here is an example: https://github.com/rsalmei/alive-progress/issues/157

The issue will be fixed when Colab upgrades to Python 3.11 in April 2024 - the change was not backported to python 3.10 (which colab currently runs) as it was deemed too minor, but it effects ns-train as it will not be able to train our nerfs. More on this is described here: https://github.com/python/cpython/issues/86340