Missing ckpt and .obj upon training completion

ClayOfMan commented 7 months ago

OS: Ubuntu 22.04 GPU: gtx 3060 TI 12gb Cuda 11.3 gcc/g++ are gcc-9/g++-9 using update-alternative

Upon completion of a model being trained the expected .obj and are not within the resulting /exp/neuralangelo-colmap_sparse-gerrard-hall/@xxx-xxx folder.

In following https://github.com/hugoycj/Instant-angelo/issues/36 I tried exporting and it resulted in this error:

(instant-angelo) levis@levis-DesktopLinux:~/Desktop/Instant-angelo$ python export.py --exp_dir exp/neuralangelo-colmap_sparse-gerrard-hall/@20240413-172649 --res 1024
INFO:root:Start exporting.
Traceback (most recent call last):
  File "export.py", line 83, in <module>
    main()
  File "export.py", line 30, in main
    latest_ckpt = sorted(os.listdir(ckpt_dir), key=lambda s: int(s.split('-')[0].split('=')[1]), reverse=True)[0]
FileNotFoundError: [Errno 2] No such file or directory: 'exp/neuralangelo-colmap_sparse-gerrard-hall/@20240413-172649/ckpt'

Showing that the ckpt files are not present either.

Full Log:

(instant-angelo) levis@levis-DesktopLinux:~/Desktop/Instant-angelo$ bash run_neuralangelo-colmap_sparse.sh datasets/gerrard-hall
---sfm---
Sparse map datasets/gerrard-hall exist.  Aborting
---sparse_visualize---
---angelo_recon---
Global seed set to 42
Extracting surface at resolution 768 768 768
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Loading sparse prior from datasets/gerrard-hall/sparse/0/points3D.bin
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type      | Params
------------------------------------
0 | model | NeuSModel | 28.0 M
------------------------------------
28.0 M    Trainable params
0         Non-trainable params
28.0 M    Total params
55.936    Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Update finite_difference_eps to 0.040807057654505825
(  ●   ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)run_neuralangelo-colmap_sparse.sh: line 15:  4103 Killed                  python launch.py --config configs/neuralangelo-colmap_sparse.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR
start time: 2024-04-13 17:39:16
sfm time: 2024-04-13 17:39:16
sparse_visualize finished:
angelo_recon finished: 2024-04-13 17:41:36
(instant-angelo) levis@levis-DesktopLinux:~/Desktop/Instant-angelo$ python export.py --exp_dir exp/neuralangelo-colmap_sparse-gerrard-hall/@20240413-172649 --res 1024
INFO:root:Start exporting.
Traceback (most recent call last):
  File "export.py", line 83, in <module>
    main()
  File "export.py", line 30, in main
    latest_ckpt = sorted(os.listdir(ckpt_dir), key=lambda s: int(s.split('-')[0].split('=')[1]), reverse=True)[0]
FileNotFoundError: [Errno 2] No such file or directory: 'exp/neuralangelo-colmap_sparse-gerrard-hall/@20240413-172649/ckpt'

How can I get a result from this training?

hugoycj commented 7 months ago

( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)run_neuralangelo-colmap_sparse.sh: line 15: 4103 Killed python launch.py --config configs/neuralangelo-colmap_sparse.yaml --gpu 0 --train dataset.root_dir=$INPUT_DIR

Sorry to bother. Seems the same issued as https://github.com/hugoycj/Instant-angelo/issues/41 . I will try to fix it this week

lexvandersluijs commented 6 months ago

I encountered the same issue with a dataset with 90 images, on a Ubuntu 20.04 laptop with RTX 3080 Mobile 16GB and 32 GB main memory RAM. When I reduce the dataset to 23 images the code runs, but already consumes 13 GB of VRAM. About 10 GB of CPU RAM is used, so that is less of a concern. If there's any way that the GPU memory consumption can be reduced and larger datasets can be supported, that would be fantastic. I will also try downscaling the images; right now they are 1920x1080 and I haven't checked if any downscaling happens during the execution of Instant-angelo. If that's not the case, then this could be a solution as well.

Update: the training stage completes almost entirely when I use a subset of 23 images, but it gives an OOM in the Validation stage. The mesh extraction script also does this, no matter how I try to change the parameters for the isosurface, so I'm hypothesizing that it's not related to that. For example, when I set the --res parameter to 512 on the command line, the size of the grid being created is reported as 768 768 768, down from 1536 1536 1536 with the default value of 1024, but the OOM remains.

hugoycj / Instant-angelo

Missing ckpt and .obj upon training completion #46