kakaobrain / nerf-factory

An awesome PyTorch NeRF library
https://kakaobrain.github.io/NeRF-Factory
Apache License 2.0
1.27k stars 106 forks source link

Loss NaN #13

Closed shawnsya closed 1 year ago

shawnsya commented 1 year ago

image While training MipNeRF360 on dataset nerf_360_v2 and it turned out loss nan Config as followed:

360-v2 Specific Arguments

run.dataset_name = "nerf_360_v2" run.datadir = "data/nerf_360_v2"

run.dataset_name = "nerf_360

LitData.batch_sampler = "all_images"

MipNeRF Standard Specific Arguments

run.model_name = "mipnerf360" run.max_steps = 1000000 run.log_every_n_steps = 100

LitData.load_radii = True LitData.batch_size = 4096 LitData.chunk = 4096 LitData.use_pixel_centers = True LitData.epoch_size = 250000

LitDataNeRF360V2.near = 0.1 LitDataNeRF360V2.far = 1e6

MipNeRF360.opaque_background = True

run.grad_max_norm = 0.001

jeongyw12382 commented 1 year ago

Hi. Could you provide more details about the scene you used? We have runned our code for two times for each scene but did not observe any errors. Did you use your custom data?

cococolorful commented 1 year ago

Hi. Could you provide more details about the scene you used? We have runned our code for two times for each scene but did not observe any errors. Did you use your custom data?

I also ran into this issue,it seems that this problem only occurs on mipnerf360,and it works fine on other models I'm using data downloaded from http://storage.googleapis.com/gresearch/refraw360/360_v2.zip

jeongyw12382 commented 1 year ago

@cococolorful Oh. We should fix this issue if this error exists. To be specific, can I ask you about the specific scenes' name you've run? We should first reproduce the bug to actually find where the problem first caused.

cococolorful commented 1 year ago

@cococolorful Oh. We should fix this issue if this error exists. To be specific, can I ask you about the specific scenes' name you've run? We should first reproduce the bug to actually find where the problem first caused.

@jeongyw12382 Thanks for your reply! In fact, for all the scenarios in the dataset “nerf_360_v2”, I had the NAN problem at the very beginning of training.:sob:.And the dataset can run not only on other models of NeRF-Factor, but also on multinerf. The following figure is a screenshot of the training garden scene, and other scenes are the same as this one. image Since I only have a 3090 graphics card, I adjusted the batch_size to 1024. image

I can provide my environment configuration if this is useful.

environment .yml ``` name: nerf_factory channels: - pytorch - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ - defaults dependencies: - _libgcc_mutex=0.1=conda_forge - _openmp_mutex=4.5=2_gnu - blas=1.0=mkl - brotlipy=0.7.0=py38h0a891b7_1005 - bzip2=1.0.8=h7f98852_4 - ca-certificates=2022.9.24=ha878542_0 - certifi=2022.9.24=pyhd8ed1ab_0 - cffi=1.15.1=py38h4a40e3a_2 - charset-normalizer=2.1.1=pyhd8ed1ab_0 - cryptography=38.0.3=py38h80a4ca7_0 - cudatoolkit=11.3.1=h9edb442_10 - ffmpeg=4.3=hf484d3e_0 - freetype=2.12.1=hca18f0e_0 - gmp=6.2.1=h58526e2_0 - gnutls=3.6.13=h85f3911_1 - idna=3.4=pyhd8ed1ab_0 - intel-openmp=2021.4.0=h06a4308_3561 - jpeg=9e=h166bdaf_2 - lame=3.100=h166bdaf_1003 - lcms2=2.14=h6ed2654_0 - ld_impl_linux-64=2.39=hc81fddc_0 - lerc=4.0.0=h27087fc_0 - libdeflate=1.14=h166bdaf_0 - libffi=3.4.2=h7f98852_5 - libgcc-ng=12.2.0=h65d4601_19 - libgomp=12.2.0=h65d4601_19 - libiconv=1.17=h166bdaf_0 - libnsl=2.0.0=h7f98852_0 - libpng=1.6.38=h753d276_0 - libsqlite=3.40.0=h753d276_0 - libstdcxx-ng=12.2.0=h46fd767_19 - libtiff=4.4.0=h55922b4_4 - libuuid=2.32.1=h7f98852_1000 - libuv=1.44.2=h166bdaf_0 - libwebp-base=1.2.4=h166bdaf_0 - libxcb=1.13=h7f98852_1004 - libzlib=1.2.13=h166bdaf_4 - mkl=2021.4.0=h06a4308_640 - mkl-service=2.4.0=py38h95df7f1_0 - mkl_fft=1.3.1=py38h8666266_1 - mkl_random=1.2.2=py38h1abd341_0 - ncurses=6.3=h27087fc_1 - nettle=3.6=he412f7d_0 - numpy=1.23.4=py38h14f4228_0 - numpy-base=1.23.4=py38h31eccc5_0 - openh264=2.1.1=h780b84a_0 - openjpeg=2.5.0=h7d73246_1 - openssl=3.0.7=h166bdaf_0 - pillow=9.2.0=py38h9eb91d8_3 - pip=22.3.1=pyhd8ed1ab_0 - pthread-stubs=0.4=h36c2ea0_1001 - pycparser=2.21=pyhd8ed1ab_0 - pyopenssl=22.1.0=pyhd8ed1ab_0 - pysocks=1.7.1=pyha2e5f31_6 - python=3.8.13=ha86cf86_0_cpython - python_abi=3.8=2_cp38 - pytorch=1.11.0=py3.8_cuda11.3_cudnn8.2.0_0 - pytorch-mutex=1.0=cuda - readline=8.1.2=h0f457ee_0 - requests=2.28.1=pyhd8ed1ab_1 - setuptools=65.5.1=pyhd8ed1ab_0 - six=1.16.0=pyh6c4a22f_0 - sqlite=3.40.0=h4ff8645_0 - tk=8.6.12=h27826a3_0 - torchaudio=0.11.0=py38_cu113 - torchvision=0.12.0=py38_cu113 - typing_extensions=4.4.0=pyha770c72_0 - urllib3=1.26.11=pyhd8ed1ab_0 - wheel=0.38.4=pyhd8ed1ab_0 - xorg-libxau=1.0.9=h7f98852_0 - xorg-libxdmcp=1.1.3=h7f98852_0 - xz=5.2.6=h166bdaf_0 - zlib=1.2.13=h166bdaf_4 - zstd=1.5.2=h6239696_4 - pip: - absl-py==1.3.0 - aiohttp==3.8.3 - aiosignal==1.3.1 - async-timeout==4.0.2 - attrs==22.1.0 - beautifulsoup4==4.11.1 - cachetools==5.2.0 - click==8.1.3 - configargparse==1.5.3 - docker-pycreds==0.4.0 - filelock==3.8.0 - fire==0.4.0 - frozenlist==1.3.3 - fsspec==2022.11.0 - functorch==0.1.1 - gdown==4.5.3 - gin-config==0.5.0 - gitdb==4.0.9 - gitpython==3.1.29 - google-auth==2.14.1 - google-auth-oauthlib==0.4.6 - grpcio==1.50.0 - imageio==2.22.4 - imageio-ffmpeg==0.4.7 - importlib-metadata==5.0.0 - lightning-utilities==0.3.0 - markdown==3.4.1 - markupsafe==2.1.1 - multidict==6.0.2 - networkx==2.8.8 - ninja==1.11.1 - oauthlib==3.2.2 - opencv-python==4.6.0.66 - packaging==21.3 - pathtools==0.1.2 - piqa==1.2.2 - promise==2.3 - protobuf==3.20.3 - psutil==5.9.4 - pyasn1==0.4.8 - pyasn1-modules==0.2.8 - pyparsing==3.0.9 - pytorch-lightning==1.8.2 - pywavelets==1.4.1 - pyyaml==6.0 - requests-oauthlib==1.3.1 - rsa==4.9 - scikit-image==0.19.3 - scipy==1.9.3 - sentry-sdk==1.11.0 - setproctitle==1.3.2 - shortuuid==1.0.11 - smmap==5.0.0 - soupsieve==2.3.2.post1 - tensorboard==2.11.0 - tensorboard-data-server==0.6.1 - tensorboard-plugin-wit==1.8.1 - termcolor==2.1.1 - tifffile==2022.10.10 - torch-efficient-distloss==0.1.3 - torch-scatter==2.0.9 - torchmetrics==0.10.3 - tqdm==4.64.1 - wandb==0.13.5 - werkzeug==2.2.2 - yarl==1.8.1 - zipp==3.10.0 prefix: /home/hgx/miniconda3/envs/nerf_factory ```

In fact, I have been trying to configure the environment directly with conda env create --file nerf_factory.yml, but my side has not been successful due to network issues.:dizzy_face:.I'll keep trying.

Finally, thank you for your excellent work!Clear code logic helps me understand the paper, thank you for your contributions!:cupid:

zongwave commented 1 year ago

Hi, I also run into "loss/psnr" nan issue on mipnerf360/360_v2 training. I'm using 3070Ti with 8GB memory.

https://github.com/kakaobrain/NeRF-Factory/blob/main/src/model/mipnerf360/model.py#L126 raw_density = self.density_layer(x)[..., 0] the raw_density values returned from MLP are NaN " MLP predict_density raw_density=tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=) "

The memory usage is about 86% during the training. |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 47% 54C P2 208W / 290W | 5152MiB / 8192MiB | 86% Default | | | | N/A

If reduce the training batch size to 512, then the training can be performed successfully.

jeongyw12382 commented 1 year ago

Hmm.. in my opinion, if this issue is observed also in the original multinerf implementation, it might indicate that the MipNeRF360 requires a large batch size to successfully train the model or MipNeRF360 is sensitive to hyperparmeters such as bsz. Since our implementation is a 're-implementation' based on the original implementation, we could not address this issue. Accroding to the reply by @zongwave, adjusting hyperparameters could address this issue.

jeongyw12382 commented 1 year ago

Let us know if you need more helps for this issue. If so, please re-open the issue.

19991105 commented 4 months ago

Hi!I adjusted the batch_size in all configs files to 512 and it solved the problem of loss=NaN.