taichi-dev / taichi-nerfs

Implementations of NeRF variants based on Taichi + PyTorch
Apache License 2.0
726 stars 49 forks source link

./scripts/train_nsvf_lego.sh: line 11: 18824 Segmentation fault #76

Open guwenxiang1 opened 1 year ago

guwenxiang1 commented 1 year ago

Thanks for your great work, I encountered the following error when running the ./scripts/train_nsvf_lego.sh script: My GPU is RTX3090 and the system is Ubuntu 18.04.6 LTS. error message are show in the "details": [Taichi] version 1.7.0, llvm 15.0.4, commit a992f22e, linux, python 3.9.16 [Taichi] Starting on arch=cuda Loading 100 train images ... 100it [00:02, 38.78it/s] Loading 200 test images ... 200it [00:05, 39.27it/s] Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032 Failed to import apex FusedAdam, use torch Adam instead. [W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13858] Local store may lose precision: f16 <- f32

[W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13883] Local store may lose precision: f16 <- f32

[W 07/06/23 20:09:27.367 19027] [type_check.cpp:type_check_store@37] [$13908] Local store may lose precision: f16 <- f32

elapsed_time=2.19s | step=0 | psnr=10.84 | loss=0.082505 | rays=8192 | rm_s=246.9 | vr_s=246.9 | elapsed_time=18.99s | step=1000 | psnr=28.49 | loss=0.001417 | rays=8192 | rm_s=26.7 | vr_s=14.3 | elapsed_time=31.39s | step=2000 | psnr=31.23 | loss=0.000753 | rays=8192 | rm_s=24.6 | vr_s=9.8 | elapsed_time=43.47s | step=3000 | psnr=31.70 | loss=0.000675 | rays=8192 | rm_s=24.9 | vr_s=9.0 | elapsed_time=55.69s | step=4000 | psnr=32.43 | loss=0.000572 | rays=8192 | rm_s=24.3 | vr_s=8.3 | elapsed_time=68.28s | step=5000 | psnr=33.79 | loss=0.000418 | rays=8192 | rm_s=22.9 | vr_s=8.1 | elapsed_time=81.02s | step=6000 | psnr=33.98 | loss=0.000400 | rays=8192 | rm_s=23.7 | vr_s=7.1 | elapsed_time=93.21s | step=7000 | psnr=34.45 | loss=0.000359 | rays=8192 | rm_s=23.5 | vr_s=7.0 | elapsed_time=105.40s | step=8000 | psnr=35.15 | loss=0.000305 | rays=8192 | rm_s=23.7 | vr_s=7.0 | elapsed_time=118.02s | step=9000 | psnr=35.66 | loss=0.000272 | rays=8192 | rm_s=23.4 | vr_s=6.7 | elapsed_time=130.00s | step=10000 | psnr=35.29 | loss=0.000296 | rays=8192 | rm_s=23.0 | vr_s=6.6 | elapsed_time=142.20s | step=11000 | psnr=34.47 | loss=0.000357 | rays=8192 | rm_s=23.4 | vr_s=6.6 | elapsed_time=154.58s | step=12000 | psnr=35.71 | loss=0.000269 | rays=8192 | rm_s=24.0 | vr_s=6.5 | elapsed_time=166.82s | step=13000 | psnr=36.06 | loss=0.000248 | rays=8192 | rm_s=23.7 | vr_s=6.6 | elapsed_time=179.23s | step=14000 | psnr=36.39 | loss=0.000230 | rays=8192 | rm_s=22.9 | vr_s=6.3 | elapsed_time=191.38s | step=15000 | psnr=36.37 | loss=0.000231 | rays=8192 | rm_s=23.4 | vr_s=6.4 | elapsed_time=203.60s | step=16000 | psnr=36.54 | loss=0.000222 | rays=8192 | rm_s=23.4 | vr_s=6.6 | elapsed_time=216.32s | step=17000 | psnr=37.19 | loss=0.000191 | rays=8192 | rm_s=23.1 | vr_s=6.5 | elapsed_time=229.25s | step=18000 | psnr=37.12 | loss=0.000194 | rays=8192 | rm_s=22.7 | vr_s=6.1 | elapsed_time=241.81s | step=19000 | psnr=37.59 | loss=0.000174 | rays=8192 | rm_s=22.7 | vr_s=6.4 | elapsed_time=253.84s | step=20000 | psnr=37.10 | loss=0.000195 | rays=8192 | rm_s=22.5 | vr_s=6.1 | evaluating: 0%| | 0/200 [00:00<?, ?it/s][W 07/06/23 20:13:40.575 18824] [type_check.cpp:type_check_store@37] [$28398] Global store may lose precision: u8 <- i32 File "/data2/gwx/taichi-nerfs/modules/ray_march.py", line 254, in raymarching_test_kernel: valid_mask[idx] = 1 ^^^^^^^^^^^^^^^^^^^

evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:10<00:00, 19.48it/s] evaluation: psnr_avg=34.72092056274414 | ssim_avg=0.9757876992225647 [Taichi] Starting on arch=cuda Loading 100 train images ... 100it [00:02, 39.17it/s] Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032 loading ckpt from: results/model.pth ./scripts/train_nsvf_lego.sh: 行 11: 18824 段错误 (核心已转储) python3 train.py --root_dir $DATA_DIR/Lego --exp_name Lego --batch_size 8192 --lr 1e-2 --gui

jexiaong commented 1 year ago

I'm also running into a similar issue however I segfault earlier than you did, I used a NVIDIA GeForce RTX 3080 Ti and wsl Ubuntu 18.04.6 LTS

[Taichi] version 1.7.0, llvm 15.0.4, commit a992f22e, linux, python 3.8.17 [W 07/06/23 14:41:38.472 29689] [cuda_driver.cpp:load_lib@36] libcuda.so lib not found. [W 07/06/23 14:41:38.473 29689] [misc.py:adaptive_arch_select@747] Arch=[<Arch.cuda: 3>] is not supported, falling back to CPU [Taichi] Starting on arch=x64 Loading 100 train images ... 100it [00:02, 48.16it/s] Loading 200 test images ... 200it [00:04, 43.80it/s] Hash Encoder: base_res=16 max_res=1024 hash_level=16 feat_per_level=2 per_level_scale=0.2772588722239781 total_hash_size=5710032 Failed to import apex FusedAdam, use torch Adam instead. [W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14165] Local store may lose precision: f16 <- f32

[W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14190] Local store may lose precision: f16 <- f32

[W 07/06/23 14:41:49.776 29751] [type_check.cpp:type_check_store@37] [$14215] Local store may lose precision: f16 <- f32

./scripts/train_nsvf_lego.sh: line 11: 29689 Segmentation fault python3 train.py --root_dir $DATA_DIR/Lego --exp_name Lego --batch_size 8192 --lr 1e-2 --gui