nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.51k stars 1.3k forks source link

COLMAP crashed causing the machine to restart #3332

Open YuhsiHu opened 3 months ago

YuhsiHu commented 3 months ago

Describe the bug When using nerfstudio to process image data, the computer will restart directly during the bundle adjustment stage. It doesn't happen when the number of pictures is small, but it does happen with a large number of pictures (around 400 pics, 4000*6000).

Hardware GPU: 4090, 24G Mem: 126G Swap: 2G

To Reproduce Steps to reproduce the behavior:

  1. Run ns-process-data
  2. In BA stage, the machine reboots.

Expected behavior After completing BA, we get the sparse reconstruction result.

Additional context

SharkWipf commented 3 months ago

If your machine restarts, this suggests a hardware issue or other external issue on your end. There is nothing in Colmap or Nerfstudio (or any other non-kernelspace program for that matter) that would be able to cause a machine restart under normal conditions (short of Nvidia driver bugs or completely running out of RAM or disk space, but that seems unlikely here). Given the triggering conditions, are you sure your power supply is up to the task, and your thermals are fine? Running these kinds of workloads puts your system (particularly your GPU, and the 4090 is an extremely power-hungry beast) under a lot of strain, and while most power supplies are designed to handle short spikes above their specification, sustained power draw above what they're rated for will trigger failsafes causing a power cut or restart. Similarly, other hardware/thermal errors will also trigger failsafes causing restarts when things become too extreme.

No regular software should ever be able to cause a crash so badly that your system restarts unless there are hardware or kernel driver issues involved, and the latter seem unlikely here.

YuhsiHu commented 3 months ago

Thank you for your reply. This new machine has been used in recent months to process MVS, NeRF, and Gaussian Splatting programs, and this is the first time I have encountered this problem.

The program can run normally in the following cases:

Based on the above situation and my observations when the program was running (only a small part of the CPU and GPU were used), I think there may be some errors when writing files after BA, and a certain resource was instantly occupied, causing a restart.

I will try to process this dataset on other machines and update the progress.