hustvl / 4DGaussians

[CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
https://guanjunwu.github.io/4dgs/
Apache License 2.0
2.25k stars 187 forks source link

train.py gives me "loss is nan" as soon as it starts training #142

Closed ChaerinMin closed 5 months ago

ChaerinMin commented 5 months ago

Thank you for your great work! The NaN happens with both dnerf and hypernerf dataset. It gives nan at the first loss.backward(), so I cannot do anything furtherπŸ₯Ή


Optimizing Output folder: ./output/hypernerf/3dprinter [12/06 16:21:54] feature_dim: 48 [12/06 16:21:54] load finished [12/06 16:21:54] 207it [00:00, 172231.88it/s] format finished [12/06 16:21:54] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 207/207 [00:02<00:00, 99.95it/s] findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans. findfont: Generic family 'sans-serif' not found because none of the following families were found: Times New Roman Loading Training Cameras [12/06 16:21:56] Loading Test Cameras [12/06 16:21:56] Loading Video Cameras [12/06 16:21:56] Deformation Net Set aabb [14.06674004 19.73840332 36.85457611] [-12.12202358 -15.1347599 4.79065514] [12/06 16:21:56] Voxel Plane: set aabb= Parameter containing: tensor([[ 14.0667, 19.7384, 36.8546], [-12.1220, -15.1348, 4.7907]]) [12/06 16:21:56] Number of points at initialisation : 96466 [12/06 16:21:57] Training progress: 0%| | 0/3000 [00:00<?, ?it/s]data loading done [12/06 16:21:57] loss is nan,end training, reexecv program now. [12/06 16:21:57] Optimizing Output folder: ./output/hypernerf/3dprinter [12/06 16:21:59] feature_dim: 48 [12/06 16:21:59] load finished [12/06 16:21:59] 207it [00:00, 176827.07it/s] format finished [12/06 16:21:59] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 207/207 [00:02<00:00, 98.22it/s] findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans. findfont: Generic family 'sans-serif' not found because none of the following families were found: Times New Roman Loading Training Cameras [12/06 16:22:01] Loading Test Cameras [12/06 16:22:01] Loading Video Cameras [12/06 16:22:01] Deformation Net Set aabb [14.06674004 19.73840332 36.85457611] [-12.12202358 -15.1347599 4.79065514] [12/06 16:22:01] Voxel Plane: set aabb= Parameter containing: tensor([[ 14.0667, 19.7384, 36.8546], [-12.1220, -15.1348, 4.7907]]) [12/06 16:22:01] Number of points at initialisation : 96466 [12/06 16:22:02] Training progress: 0%| | 0/3000 [00:00<?, ?it/s]data loading done [12/06 16:22:02] loss is nan,end training, reexecv program now. [12/06 16:22:02] Optimizing Output folder: ./output/hypernerf/3dprinter [12/06 16:22:04] feature_dim: 48 [12/06 16:22:04] load finished [12/06 16:22:04] 207it [00:00, 173713.67it/s] format finished [12/06 16:22:04] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 207/207 [00:02<00:00, 99.28it/s] findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans. findfont: Generic family 'sans-serif' not found because none of the following families were found: Times New Roman Loading Training Cameras [12/06 16:22:06] Loading Test Cameras [12/06 16:22:06] Loading Video Cameras [12/06 16:22:06] Deformation Net Set aabb [14.06674004 19.73840332 36.85457611] [-12.12202358 -15.1347599 4.79065514] [12/06 16:22:06] Voxel Plane: set aabb= Parameter containing: tensor([[ 14.0667, 19.7384, 36.8546], [-12.1220, -15.1348, 4.7907]]) [12/06 16:22:06] Number of points at initialisation : 96466 [12/06 16:22:06]

guanjunwu commented 5 months ago

Hi, Can you check your cuda version and pytorch version? Most problems happens because of above issue :(

ChaerinMin commented 5 months ago

Yesss!!! I was python 3.7, pytorch 1.13, cuda 11.7, and it produced NaN. But when I changed into python 3.7, pytorch 1.13, cuda 11.6, it worked. Amazing. Thanks for your help and it would be even great if you could put this finding in the README to help the future users. Thank you a lot.