nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.31k stars 1.26k forks source link

NeRFStudio + Volinga: Wrong ckpt format with NeRFStudio latest commit. #2060

Open smallfly opened 1 year ago

smallfly commented 1 year ago

Describe the bug I have just updated NeRFStudio to the latest commit, and since then it seems that the ckpt (trained using ns-train volinga...) do not work with the Volinga convertor. After a successful upload I get the error code 607 and message "Wrong ckpt format"

To Reproduce Steps to reproduce the behavior:

  1. Train a NeRF using ns-train volinga...
  2. Upload the generated ckpt on Volinga's website
  3. After few second that processing has started, it fails

Additional context Even if there is a thread on Volinga's Discord server about this, I thought it could be useful to open an issue here too. It seems that the ckpt structure has changed.

machenmusik commented 1 year ago

Can you try from commit https://github.com/nerfstudio-project/nerfstudio/commit/d4b04376abd46d9bddf8c299d1687177fe027951 and see if that works?

marvo737 commented 1 year ago

Although v0.3.0 was a workaround, it was able to read the file without failure.

Frivas97 commented 1 year ago

It should be solved now :)

AFMagnon commented 11 months ago

I met the same error of "Wrong ckpt format" in Volinga.ai.

My environment is as follows:

I took following steps:

  1. installed anaconda3

  2. create virtual env with python==3.10.13

  3. installed pip packages

    python -m pip install --upgrade pip
    conda install -c "nvidia/label/cuda-11.8.0" cudatoolkit=11.8
    conda install -c "nvidia/label/cuda-11.8.0" cuda-nvcc
    python -m pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/cu118
  4. installed ninja and skipped tiny-cuda-nn I gave up tiny-cuda-nn due to the installation error.

  5. installed nerfstudio python -m pip install nerfstudio

  6. installed volinga-model

    git clone https://github.com/Volinga/volinga-model
    cd volinga-model
    python -m pip install -e . --user

    then, I checked "volinga" is in the console output after executing ns-train -h

  7. created volinga ckpt file

    ns-train volinga --data data/nerfstudio/poster --vis viewer

    , and successfully created ckpt file.

  8. uploaded ckpt file to https://volinga.ai/main

  9. failed uploading uploading procedure became 100%, but it failed. the error information is displayed in volinga.ai as below:

    Error code: 607
    Info: Wrong ckpt format
    

I would be appreciate if anybody could help me with the solution.

machenmusik commented 11 months ago

I think you may need to install a specific version of nerfstudio to train volinga? I saw this from a quick look... https://github.com/Volinga/volinga-model#1-install-nerfstudio--v032

AFMagnon commented 11 months ago

Thank you, machenmusik.

I re-installed all the packages, and specified the version of nerfstudio==0.3.2. However, ns-train volinga --data data/nerfstudio/poster --vis viewer did not work properly. I thought this was due to another reason (version mismatch? or source code problem?).

I would be grateful if you have any clue about solving this problem.

I paste error message after performing ns-train volinga --data data/nerfstudio/poster --vis viewer below:

[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled tensorboard/wandb event writers
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 4.0080
VanillaPipeline.get_train_loss_dict: 3.9890
Traceback (most recent call last):
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\Scripts\ns-train.exe\__main__.py", line 7, in <module>
    sys.exit(entrypoint())
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\scripts\train.py", line 261, in entrypoint
    main(
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\scripts\train.py", line 246, in main
    launch(
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\scripts\train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\scripts\train.py", line 100, in train_loop
    trainer.train()
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\engine\trainer.py", line 255, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\utils\profiler.py", line 127, in inner
    out = func(*args, **kwargs)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\engine\trainer.py", line 468, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\utils\profiler.py", line 127, in inner
    out = func(*args, **kwargs)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\pipelines\base_pipeline.py", line 281, in get_train_loss_dict
    model_outputs = self._model(ray_bundle)  # train distributed data parallel model if world_size > 1
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\models\base_model.py", line 142, in forward
    return self.get_outputs(ray_bundle)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\models\nerfacto.py", line 278, in get_outputs
    field_outputs = self.field.forward(ray_samples, compute_normals=self.config.predict_normals)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\fields\base_field.py", line 124, in forward
    density, density_embedding = self.get_density(ray_samples)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\fields\nerfacto_field.py", line 216, in get_density
    h = self.mlp_base(positions_flat).view(*ray_samples.frustums.shape, -1)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\field_components\mlp.py", line 178, in forward
    return self.pytorch_fwd(in_tensor)
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\nerfstudio\field_components\mlp.py", line 164, in pytorch_fwd
    for i, layer in enumerate(self.layers):
  File "D:\anaconda3\envs\nerfstudio032_py310_pip\lib\site-packages\torch\nn\modules\module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'MLP' object has no attribute 'layers'
Frivas97 commented 11 months ago

Hello! @AFMagnon, If I understood correctly, you are not using tiny cuda to train your models right?

AFMagnon commented 11 months ago

Thank you for your comment, Frivas97. Thats right, I did not install tiny-cuda-nn. After installing ninja, I failed installing tiny-cuda-nn with following error message:

 Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
      Cuda compilation tools, release 11.8, V11.8.89
      Build cuda_11.8.r11.8/compiler.31833905_0
      Detected CUDA version 11.8
      Targeting C++ standard 17
      running bdist_wheel
      D:\anaconda3\envs\nerfstudio_py310\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
        warnings.warn(msg.format('we could not find ninja.'))

Without this package, I succeeded to implement ns-train nerfacto --data data/nerfstudio/poster. So, I thought this is not necessary...

Frivas97 commented 11 months ago

If tiny cuda is not available, NeRFStudio will use vanilla Pytorch to train the model. The problem is that the way in which the model is stored in the ckpt varies depending if you use tiny cuda or not. At this moment, we only support the tiny cuda "format" on Volinga exporter. This is probably the reason for the failure.

AFMagnon commented 11 months ago

Thank you! To sum up, I should build the following environment, right? ・nerfstudio==0.3.2 ・manage to install tiny-cuda-nn

So, I will focus on installation of tiny-cuda-nn with nerfstudio of 0.3.2. I found the useful link for the installation. that reads:

git clone https://github.com/NVlabs/tiny-cuda-nn.git
cd tiny-cuda-nn
git submodule update --init --recursive
python -m pip install ./bindings/torch

but it failed... I'll make an effort somehow...

Frivas97 commented 11 months ago

That's right, that is the setup you need.

machenmusik commented 11 months ago

There have been various periods where the latest version of tiny-cuda-nn was broken and so new installs would fail.

(@Frivas97 if it's not already there, you may want to add the tiny-cuda-nn requirement to docs and maybe even implementation of your method, as it differs from others...)

machenmusik commented 11 months ago

(@AFMagnon do you have CUDA libraries and Visual Studio installed? CUDA version 11.8 is known to work well, and IIRC community versions of VS may suffice... not sure whether just windows-build-tools does)

AFMagnon commented 11 months ago

@machenmusik , thank you for your suggestion. I have tried to build environment that satisfy :

After building the environment, I re-tried the installation:

#install pip packages
python -m pip install --upgrade pip
python -m pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/cu118
#tiny-cuda installation
git clone https://github.com/NVlabs/tiny-cuda-nn.git
cd tiny-cuda-nn
git submodule update --init --recursive
python -m pip install ./bindings/torch

But python -m pip install ./bindings/torch failed...

The error message is as follows:

Processing c:\users\UserName\tiny-cuda-nn\bindings\torch
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      <string>:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      Traceback (most recent call last):
        File "C:\Users\UserName\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "C:\Users\UserName\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_p          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "C:\Users\UserName\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "C:\Users\UserName\AppData\Local\Temp\pip-build-env-0c0t_z5q\overlay\Lib\site-packages\setuptools\build_meta.py", line 355, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "C:\Users\UserName\AppData\Local\Temp\pip-build-env-0c0t_z5q\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in _get_build_requires
          self.run_setup()
        File "C:\Users\UserName\AppData\Local\Temp\pip-build-env-0c0t_z5q\overlay\Lib\site-packages\setuptools\build_meta.py", line 507, in run_setup
          super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
        File "C:\Users\UserName\AppData\Local\Temp\pip-build-env-0c0t_z5q\overlay\Lib\site-packages\setuptools\build_meta.py", line 341, in run_setup
          exec(code, locals())
        File "<string>", line 9, in <module>
      ModuleNotFoundError: No module named 'torch'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

That says "ModuleNotFoundError: No module named 'torch'". Torch installation I think was successful because I entered python interpreter mode and import torch was successful.

machenmusik commented 11 months ago

Maybe silly question, but:

Have you tried from launching x64 native developer command prompt, and then invoking your python virtual environment or equivalent that you use for nerfstudio?

To install tiny-cuda-nn, I have generally used what is in the nerfstudio readme... pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

AFMagnon commented 11 months ago

@machenmusik , Yes. when I install tiny-cuda-nn, I used both pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch and

git clone https://github.com/NVlabs/tiny-cuda-nn.git
cd tiny-cuda-nn
git submodule update --init --recursive
python -m pip install ./bindings/torch

The former command would fail because fmt and cutlass folders were empty https://github.com/NVlabs/tiny-cuda-nn/issues/208 , and I also failed... The latter command can install fmt and cutlass components, but I failed......

I also tried these two installation procedure with x64 native developer command prompt, but I failed.........

The failures are displayed with two kinds of error messages. one is https://github.com/nerfstudio-project/nerfstudio/issues/2060#issuecomment-1782595635 and the other is below

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [32 lines of output]
      C:\Users\UserName\tiny-cuda-nn\bindings\torch\setup.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
        from pkg_resources import parse_version
      Building PyTorch extension for tiny-cuda-nn version 1.7
      Obtained compute capability 86 from PyTorch
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2022 NVIDIA Corporation
      Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
      Cuda compilation tools, release 11.8, V11.8.89
      Build cuda_11.8.r11.8/compiler.31833905_0
      Detected CUDA version 11.8
      Targeting C++ standard 17
      running bdist_wheel
      D:\anaconda3\envs\nerfstudio_py310\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
        warnings.warn(msg.format('we could not find ninja.'))
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\tinycudann
      copying tinycudann\modules.py -> build\lib.win-amd64-cpython-310\tinycudann
      copying tinycudann\__init__.py -> build\lib.win-amd64-cpython-310\tinycudann
      running egg_info
      creating tinycudann.egg-info
      writing tinycudann.egg-info\PKG-INFO
      writing dependency_links to tinycudann.egg-info\dependency_links.txt
      writing top-level names to tinycudann.egg-info\top_level.txt
      writing manifest file 'tinycudann.egg-info\SOURCES.txt'
      reading manifest file 'tinycudann.egg-info\SOURCES.txt'
      writing manifest file 'tinycudann.egg-info\SOURCES.txt'
      copying tinycudann\bindings.cpp -> build\lib.win-amd64-cpython-310\tinycudann
      running build_ext
      error: [WinError 2] The system cannot find the file specified.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tinycudann
  Running setup.py clean for tinycudann
Failed to build tinycudann
ERROR: Could not build wheels for tinycudann, which is required to install pyproject.toml-based projects

Im so dipressed...

I would be appreciated if you teach me your installation procedure and your environment. How did you install tiny-cuda-nn, and use volinga?

machenmusik commented 10 months ago

I followed the instructions as listed in README. Haven't had to reinstall tiny-cuda-nn in a while though. I am still using Python 3.9.x rather than 3.10.x, with miniconda.

AFMagnon commented 10 months ago

I tried with miniconda3 virtual-env of Python 3.9.18. However, the installation of tiny-cuda-nn failed due to the same error above... Could you teach me the version or commit SHA-1 of tiny-cuda-nn?

AFMagnon commented 4 months ago

Thank you!! I re-installed OS and other software, then I succeeded to install tiny-cuda-nn!! Successful conversion from ckpt file to nvol file!!

Maybe, software or library dependencies made things complex.