lsongx / nerfplayer-nerfstudio

45 stars 5 forks source link

Not training and Nerfstudio viewer doesn't show scene #3

Closed orrblue closed 1 year ago

orrblue commented 1 year ago

Hi, I ran the the code with the linked data: mochi-high-five. However, 1) Training does not seem to be occurring as the terminal doesn't show progress, and the computer doesn't sound like it's working hard. 2) In Nerfstudio's viewer, the input images show up but the scene does not show in the viewport.

In the image below, I show the terminal, the folder of outputs/mochi-high-five/nerfplayer-ngp, and the NerfStudio webpage. In the terminal, please disregard the last line with 7007 -- that was erroneous keyboard entries while I took this photo.

IMG_9097

I am able to successfully train Nerfstudio's nerfacto and view the scene with their provided data (the "poster" dataset) so I don't think it's an issue with my installation of NerfStudio. Thank you!

orrblue commented 1 year ago

I'm not sure if I need to do the below step in the instruction, but if I do, I'm not sure how to do port forwarding. "Connect to the viewer by forwarding the viewer port, and click the link to viewer.nerf.studio provided in the output of the train script"

By the way my system is Ubuntu 20.04, with an RTX 2070 GPU

lsongx commented 1 year ago

I'm not sure if I need to do the below step in the instruction, but if I do, I'm not sure how to do port forwarding. "Connect to the viewer by forwarding the viewer port, and click the link to viewer.nerf.studio provided in the output of the train script"

This part should be the same as nerfstudio.

I think we need to first find out where it stuck. Two simple ideas to do this.

  1. When the program stuck, if you hit ctrl+c to exit the program, is there any info about the exiting point?
  2. This is stupid, but I always do this. Maybe adding some printing in the pipeline, and then see where the program stops.
orrblue commented 1 year ago

Thanks for your suggestions! I looked through the info about the exiting point, and at my system's RAM usage. It appears I run out of my 16GB RAM during this process, at which point the process is either killed due to lack of memory, or gets hung up indefinitely, seemingly due to a PyTorch lock file that isn't deleted automatically when the program is killed in the former case. I need to fix the memory usage issue first, but for anyone with a lock file issue, see https://github.com/zhou13/neurvps/issues/1#issuecomment-820898095 for more information.

As memory usage is the foremost issue (16GB RAM, RTX 2070 8GB GPU) I'll look into how to reduce memory consumption. I have some ideas below, but if you have any additional tips, I would appreciate it!

From nerfstudio's Discord, I understand I can reduce a number of parameters using the below command --pipeline.datamanager.X where X can be any of these: eval-num-rays-per-batch
train-num-rays-per-batch train-num-images-to-sample-from -- How many images to train across at a given time train-num-times-to-repeat-images -- How many iterations before swapping out those images with a new set I could also simply use fewer input images

orrblue commented 1 year ago

As an update, I tried the ideas I wrote above, but they seemed to have no effect on memory consumption for nerfplayer-ngp. The high memory usage happens before training (I think) so perhaps the model or image data is large? Not sure what to do about it, and would appreciate any suggestions. Thank you

orrblue commented 1 year ago

I ended up using a computer with more RAM and GPU Memory. Turns out that the initial run of the system required up to about 19GB of RAM for a few minutes (which crashed the program (Out of Memory) when I ran it on the previous computer).

For anyone wondering, thereafter, total RAM usage of my computer was under 10GB, and GPU Memory usage was under 5GB during training on the mochi-highfive dataset. And, for the block dataset: 14+ GB RAM, 8+ GB GPU Memory

lsongx commented 1 year ago

Happy to see that you solved it, and thank you for reporting your experience here -- valuable info! 👍 I feel that the model itself won't need extra RAM. (I could be totally wrong.) Maybe it is related to the dynamic dataset -- much more images than the static scenes. Though not all images are cached, maintaining a buffer could get more expensive.

p-Cyan commented 1 year ago

I think the initial high RAM usage was because of NeRFAcc setup. It usually uses around 16-19 GB of RAM until it finishes setting it up.

orrblue commented 1 year ago

Thanks for the info!