neekolas commented 1 year ago

I am so close to getting this working, but am getting an error when running ns-train.

Steps To Reproduce

This is all being run on a Lambda Cloud VM with an A10 GPU

Upload the Linux install script for Optix 7.5 to the VM (I can only find 7.5 and 7.7, but not 7.6 on Nvidia's site), run it, and move the outputs to $HOME/optix
Put my source images in the nerf-data folder

Start a NerfStudio Docker container with

sudo docker run --rm -it --gpus all -v $HOME/nerf-data:/workspace -v $HOME/optix:/opt/optix dromni/nerfstudio:0.2.2

(all future commands are run inside the container)


export OPTIX_PATH=/opt/optix
git clone https://github.com/jkulhanek/tetra-nerf.git
python3.10 -m pip install -e tetra-nerf
cd ./tetra-nerf
# Pretty sure these next two lines aren't necessary if you are installing from source, but can't hurt
cmake .
make
export FOLDER=/workspace/inputs/bedroom-gimbal-3
# Works great, albeit slow
python3.10 -m tetranerf.scripts.process_images --path $FOLDER
# Works great
python3.10 -m tetranerf.scripts.triangulate --pointcloud $FOLDER/sparse.ply --output $FOLDER/sparse.th

export NERFSTUDIO_METHOD_CONFIGS="tetra-nerf=tetranerf.nerfstudio.registration:tetranerf"

This is where things fail

ns-train tetra-nerf --pipeline.model.tetrahedra-path $FOLDER/sparse.th minimal-parser --data $FOLDER


## Command output
```shell
user@c3a2197f7437:/workspace/tetra-nerf$ ns-train tetra-nerf --pipeline.model.tetrahedra-path $FOLDER/sparse.th minimal-parser --data $FOLDER
JAX not installed, skipping Mip-NeRF SSIM
Info: Loading method tetra-nerf from environment variable
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
    _target=<class 'nerfstudio.engine.trainer.Trainer'>,
    output_dir=PosixPath('outputs'),
    method_name='tetra-nerf',
    experiment_name=None,
    timestamp='2023-05-05_214214',
    machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'),
    logging=LoggingConfig(
        relative_log_dir=PosixPath('.'),
        steps_per_log=10,
        max_buffer_size=20,
        local_writer=LocalWriterConfig(
            _target=<class 'nerfstudio.utils.writer.LocalWriter'>,
            enable=True,
            stats_to_track=(
                <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
                <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
                <EventName.CURR_TEST_PSNR: 'Test PSNR'>,
                <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
                <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
                <EventName.ETA: 'ETA (time)'>
            ),
            max_log_size=10
        ),
        profiler='basic'
    ),
    viewer=ViewerConfig(
        relative_log_filename='viewer_log_filename.txt',
        websocket_port=None,
        websocket_port_default=7007,
        websocket_host='0.0.0.0',
        num_rays_per_chunk=32768,
        max_num_display_images=512,
        quit_on_train_completion=False,
        image_format='jpeg',
        jpeg_quality=90
    ),
    pipeline=VanillaPipelineConfig(
        _target=<class 'tetranerf.nerfstudio.pipeline.TetrahedraNerfPipeline'>,
        datamanager=VanillaDataManagerConfig(
            _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>,
            data=None,
            camera_optimizer=CameraOptimizerConfig(
                _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
                mode='off',
                position_noise_std=0.0,
                orientation_noise_std=0.0,
                optimizer=AdamOptimizerConfig(
                    _target=<class 'torch.optim.adam.Adam'>,
                    lr=0.0006,
                    eps=1e-15,
                    max_norm=None,
                    weight_decay=0
                ),
                scheduler=ExponentialDecaySchedulerConfig(
                    _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                    lr_pre_warmup=1e-08,
                    lr_final=None,
                    warmup_steps=0,
                    max_steps=10000,
                    ramp='cosine'
                ),
                param_group='camera_opt'
            ),
            dataparser=MinimalDataParserConfig(
                _target=<class 'nerfstudio.data.dataparsers.minimal_dataparser.MinimalDataParser'>,
                data=PosixPath('/workspace/inputs/bedroom-gimbal-3')
            ),
            train_num_rays_per_batch=4096,
            train_num_images_to_sample_from=-1,
            train_num_times_to_repeat_images=-1,
            eval_num_rays_per_batch=4096,
            eval_num_images_to_sample_from=-1,
            eval_num_times_to_repeat_images=-1,
            eval_image_indices=(0,),
            camera_res_scale_factor=1.0,
            patch_size=1
        ),
        model=TetrahedraNerfConfig(
            _target=<class 'tetranerf.nerfstudio.model.TetrahedraNerf'>,
            enable_collider=True,
            collider_params={'near_plane': 2.0, 'far_plane': 6.0},
            loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
            eval_num_rays_per_chunk=4096,
            tetrahedra_path=PosixPath('/workspace/inputs/bedroom-gimbal-3/sparse.th'),
            num_tetrahedra_vertices=245069,
            num_tetrahedra_cells=1525505,
            max_intersected_triangles=512,
            num_samples=128,
            num_fine_samples=128,
            use_biased_sampler=True,
            field_dim=64,
            num_color_layers=1,
            num_density_layers=3,
            hidden_size=128,
            input_fourier_frequencies=0,
            initialize_colors=True
        )
    ),
    optimizers={
        'fields': {
            'optimizer': RAdamOptimizerConfig(
                _target=<class 'torch.optim.radam.RAdam'>,
                lr=0.001,
                eps=1e-08,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=0.0001,
                warmup_steps=0,
                max_steps=300000,
                ramp='cosine'
            )
        }
    },
    vis='wandb',
    data=None,
    relative_model_dir=PosixPath('nerfstudio_models'),
    steps_per_save=25000,
    steps_per_eval_batch=1000,
    steps_per_eval_image=2000,
    steps_per_eval_all_images=50000,
    max_num_iterations=300000,
    mixed_precision=False,
    use_grad_scaler=False,
    save_only_latest_checkpoint=True,
    load_dir=None,
    load_step=None,
    load_config=None,
    log_gradients=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[21:42:14] Saving config to: outputs/unnamed/tetra-nerf/2023-05-05_214214/config.yml            experiment_config.py:129
[21:42:14] Saving checkpoints to: outputs/unnamed/tetra-nerf/2023-05-05_214214/nerfstudio_models          trainer.py:139
Setting up training dataset...
Caching all 531 images.
Warning: If you run out of memory, try reducing the number of images to sample from.
Setting up evaluation dataset...
Caching all 76 images.
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /home/user/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:00<00:00, 430MB/s]
No checkpoints to load, training from scratch
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.1
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
logging events to: outputs/unnamed/tetra-nerf/2023-05-05_214214
Tetrahedra initialized from file /workspace/inputs/bedroom-gimbal-3/sparse.th:
    Num points: 245069
    Num tetrahedra: 1525505
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 0.7833              
VanillaPipeline.get_train_loss_dict: 0.7832              
Traceback (most recent call last):
  File "/home/user/.local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/user/nerfstudio/scripts/train.py", line 247, in entrypoint
    main(
  File "/home/user/nerfstudio/scripts/train.py", line 233, in main
    launch(
  File "/home/user/nerfstudio/scripts/train.py", line 172, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/home/user/nerfstudio/scripts/train.py", line 87, in train_loop
    trainer.train()
  File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 239, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 93, in inner
    out = func(*args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/engine/trainer.py", line 433, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/home/user/nerfstudio/nerfstudio/utils/profiler.py", line 93, in inner
    out = func(*args, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 278, in get_train_loss_dict
    model_outputs = self.model(ray_bundle)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/nerfstudio/nerfstudio/models/base_model.py", line 140, in forward
    return self.get_outputs(ray_bundle)
  File "/workspace/tetra-nerf/tetranerf/nerfstudio/model.py", line 422, in get_outputs
    tracer = self.get_tetrahedra_tracer()
  File "/workspace/tetra-nerf/tetranerf/nerfstudio/model.py", line 320, in get_tetrahedra_tracer
    self._tetrahedra_tracer = TetrahedraTracer(device)
RuntimeError: OPTIX_ERROR_LIBRARY_NOT_FOUND: Optix call 'optixInit()' failed: /workspace/tetra-nerf/src/tetrahedra_tracer.cpp:148)

jkulhanek commented 1 year ago

This could be an old nvidia driver. What is the driver version on the host system? Is it supported by Optix7.5?

jkulhanek commented 1 year ago

I also believe you need to add this “ -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility” to your run command. Can you try it if it helps?

neekolas commented 1 year ago

Driver Version: 525.85.12, which looks fine for Optix 7.5.

I just tried setting the NVIDIA_DRIVER_CAPABILITIES environment variable and it didn't have any effect.

Any other ideas?

jkulhanek commented 1 year ago

Can you if you get the same error with the dockerfile in this repo?

neekolas commented 1 year ago

Same result after cloning this repo and building the image with Optix 7.5 via

sudo docker build -t tetra-nerf:latest --build-context optix=$HOME/optix .

[+] Building 41.9s (17/17) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                               0.0s
 => => transferring dockerfile: 2.38kB                                                                                                                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                  0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                                    0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04                                                                                                                                                                                                    0.7s
 => [context optix] load .dockerignore                                                                                                                                                                                                                                             0.0s
 => => transferring optix: 2B                                                                                                                                                                                                                                                      0.0s
 => [auth] nvidia/cuda:pull token for registry-1.docker.io                                                                                                                                                                                                                         0.0s
 => [context optix] load from client                                                                                                                                                                                                                                               0.0s
 => => transferring optix: 35.18kB                                                                                                                                                                                                                                                 0.0s
 => [internal] load build context                                                                                                                                                                                                                                                  0.0s
 => => transferring context: 1.91MB                                                                                                                                                                                                                                                0.0s
 => [stage-0 1/9] FROM docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04@sha256:247e6d7676f8af28ed87f343620505d823dc86c22570ead2ac59049a2583534f                                                                                                                                      0.0s
 => CACHED [stage-0 2/9] COPY --from=optix . /opt/optix                                                                                                                                                                                                                            0.0s
 => CACHED [stage-0 3/9] RUN if [ ! -e /opt/optix/include/optix.h ]; then echo "Could not find the OptiX library. Please install the Optix SDK and add the following argument to the buildx command: --build-context optix=/path/to/the/SDK"; exit 1; fi &&     apt-get update &&  0.0s
 => CACHED [stage-0 4/9] RUN export PIP_ROOT_USER_ACTION=ignore &&     pip install --upgrade pip &&     pip uninstall -y functorch &&     pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html &&     pip install nerfst  0.0s
 => CACHED [stage-0 5/9] RUN pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch                                                                                                                                                                   0.0s
 => CACHED [stage-0 6/9] RUN adduser --disabled-password user --gecos "First Last,RoomNumber,WorkPhone,HomePhone"                                                                                                                                                                  0.0s
 => CACHED [stage-0 7/9] WORKDIR /home/user                                                                                                                                                                                                                                        0.0s
 => [stage-0 8/9] COPY --chown=user . /home/user/tetra-nerf                                                                                                                                                                                                                        0.0s
 => [stage-0 9/9] RUN pip install -e tetra-nerf                                                                                                                                                                                                                                   41.1s
 => exporting to image                                                                                                                                                                                                                                                             0.2s
 => => exporting layers                                                                                                                                                                                                                                                            0.2s
 => => writing image sha256:e44e6a6b2810dec484893b0014e462e08e1a634c5e76a2558f3e81b861372a3f                                                                                                                                                                                       0.0s
 => => naming to docker.io/library/tetra-nerf:latest

sudo docker run --rm -it -v $HOME/nerf-data:/workspace -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility --gpus all tetra-nerf:latest

==========
== CUDA ==
==========

CUDA Version 11.7.1

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

user@dbe3cc25cd7c:~$ export NERFSTUDIO_METHOD_CONFIGS="tetra-nerf=tetranerf.nerfstudio.registration:tetranerf"
user@dbe3cc25cd7c:~$ export FOLDER=/workspace/inputs/bedroom-gimbal-3
user@dbe3cc25cd7c:~$ ns-train tetra-nerf --pipeline.model.tetrahedra-path $FOLDER/sparse.th minimal-parser --data $FOLDER
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
    _target=<class 'nerfstudio.engine.trainer.Trainer'>,
    output_dir=PosixPath('outputs'),
    method_name='tetra-nerf',
    experiment_name=None,
    timestamp='2023-05-06_173412',
    machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'),
    logging=LoggingConfig(
        relative_log_dir=PosixPath('.'),
        steps_per_log=10,
        max_buffer_size=20,
        local_writer=LocalWriterConfig(
            _target=<class 'nerfstudio.utils.writer.LocalWriter'>,
            enable=True,
            stats_to_track=(
                <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
                <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
                <EventName.CURR_TEST_PSNR: 'Test PSNR'>,
                <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
                <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
                <EventName.ETA: 'ETA (time)'>
            ),
            max_log_size=10
        ),
        enable_profiler=True
    ),
    viewer=ViewerConfig(
        relative_log_filename='viewer_log_filename.txt',
        websocket_port=None,
        websocket_port_default=7007,
        num_rays_per_chunk=32768,
        max_num_display_images=512,
        quit_on_train_completion=False,
        image_format='jpeg',
        jpeg_quality=90
    ),
    pipeline=VanillaPipelineConfig(
        _target=<class 'tetranerf.nerfstudio.pipeline.TetrahedraNerfPipeline'>,
        datamanager=VanillaDataManagerConfig(
            _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>,
            data=None,
            camera_optimizer=CameraOptimizerConfig(
                _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
                mode='off',
                position_noise_std=0.0,
                orientation_noise_std=0.0,
                optimizer=AdamOptimizerConfig(
                    _target=<class 'torch.optim.adam.Adam'>,
                    lr=0.0006,
                    eps=1e-15,
                    max_norm=None,
                    weight_decay=0
                ),
                scheduler=ExponentialDecaySchedulerConfig(
                    _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                    lr_pre_warmup=1e-08,
                    lr_final=None,
                    warmup_steps=0,
                    max_steps=10000,
                    ramp='cosine'
                ),
                param_group='camera_opt'
            ),
            dataparser=MinimalDataParserConfig(
                _target=<class 'nerfstudio.data.dataparsers.minimal_dataparser.MinimalDataParser'>,
                data=PosixPath('/workspace/inputs/bedroom-gimbal-3')
            ),
            train_num_rays_per_batch=4096,
            train_num_images_to_sample_from=-1,
            train_num_times_to_repeat_images=-1,
            eval_num_rays_per_batch=4096,
            eval_num_images_to_sample_from=-1,
            eval_num_times_to_repeat_images=-1,
            eval_image_indices=(0,),
            camera_res_scale_factor=1.0,
            patch_size=1
        ),
        model=TetrahedraNerfConfig(
            _target=<class 'tetranerf.nerfstudio.model.TetrahedraNerf'>,
            enable_collider=True,
            collider_params={'near_plane': 2.0, 'far_plane': 6.0},
            loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
            eval_num_rays_per_chunk=4096,
            tetrahedra_path=PosixPath('/workspace/inputs/bedroom-gimbal-3/sparse.th'),
            num_tetrahedra_vertices=245069,
            num_tetrahedra_cells=1525505,
            max_intersected_triangles=512,
            num_samples=128,
            num_fine_samples=128,
            use_biased_sampler=True,
            field_dim=64,
            num_color_layers=1,
            num_density_layers=3,
            hidden_size=128,
            input_fourier_frequencies=0,
            initialize_colors=True
        )
    ),
    optimizers={
        'fields': {
            'optimizer': RAdamOptimizerConfig(
                _target=<class 'torch.optim.radam.RAdam'>,
                lr=0.001,
                eps=1e-08,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=0.0001,
                warmup_steps=0,
                max_steps=300000,
                ramp='cosine'
            )
        }
    },
    vis='wandb',
    data=None,
    relative_model_dir=PosixPath('nerfstudio_models'),
    steps_per_save=25000,
    steps_per_eval_batch=1000,
    steps_per_eval_image=2000,
    steps_per_eval_all_images=50000,
    max_num_iterations=300000,
    mixed_precision=False,
    save_only_latest_checkpoint=True,
    load_dir=None,
    load_step=None,
    load_config=None,
    log_gradients=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[17:34:12] Saving config to: outputs/unnamed/tetra-nerf/2023-05-06_173412/config.yml            experiment_config.py:129
[17:34:12] Saving checkpoints to: outputs/unnamed/tetra-nerf/2023-05-06_173412/nerfstudio_models          trainer.py:132
Setting up training dataset...
Caching all 531 images.
Warning: If you run out of memory, try reducing the number of images to sample from.
Setting up evaluation dataset...
Caching all 76 images.
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /home/user/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:02<00:00, 86.9MB/s]
No checkpoints to load, training from scratch
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.1
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
logging events to: outputs/unnamed/tetra-nerf/2023-05-06_173412
Tetrahedra initialized from file /workspace/inputs/bedroom-gimbal-3/sparse.th:
    Num points: 245069
    Num tetrahedra: 1525505
Printing profiling stats, from longest to shortest duration in seconds
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 247, in entrypoint
    main(
  File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 233, in main
    launch(
  File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 172, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 87, in train_loop
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 232, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 43, in wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 406, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 43, in wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 278, in get_train_loss_dict
    model_outputs = self.model(ray_bundle)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/base_model.py", line 140, in forward
    return self.get_outputs(ray_bundle)
  File "/home/user/tetra-nerf/tetranerf/nerfstudio/model.py", line 422, in get_outputs
    tracer = self.get_tetrahedra_tracer()
  File "/home/user/tetra-nerf/tetranerf/nerfstudio/model.py", line 320, in get_tetrahedra_tracer
    self._tetrahedra_tracer = TetrahedraTracer(device)
RuntimeError: OPTIX_ERROR_LIBRARY_NOT_FOUND: Optix call 'optixInit()' failed: /home/user/tetra-nerf/src/tetrahedra_tracer.cpp:148)

wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync outputs/unnamed/tetra-nerf/2023-05-06_173412/wandb/offline-run-20230506_173422-vtpyj9uj
wandb: Find logs at: outputs/unnamed/tetra-nerf/2023-05-06_173412/wandb/offline-run-20230506_173422-vtpyj9uj/logs
user@dbe3cc25cd7c:~$ ls /opt/optix
SDK  doc  include
user@dbe3cc25cd7c:~$

jkulhanek commented 1 year ago

Ok, can you post the output of running with strace as suggested here? https://forums.developer.nvidia.com/t/optix-error-failed-to-load-optix-library/70671/21 Also, what gpu do you use?

jkulhanek commented 1 year ago

What cuda_compute does your GPU support?

neekolas commented 1 year ago

My GPU supports cuda_compute 8.6 (NVIDIA A10).

I ran ns-train tetra-nerf ... with strace. These were the most interesting log lines I could find.

futex(0x7f4c1339b518, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/usr/local/lib/python3.10/dist-packages/torch/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 34
newfstatat(34, "", {st_mode=S_IFREG|0644, st_size=42549, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 42549, PROT_READ, MAP_PRIVATE, 34, 0) = 0x7f4c3243f000
close(34)                               = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

jkulhanek commented 1 year ago

Yes! It is missing the optix library libnvoptix (part of the nvidia driver, not the SDK). Same as here: https://github.com/NVIDIA/nvidia-container-toolkit/issues/187 What docker setup do you use? What version of docker and what version of nvidia container toolkit?

neekolas commented 1 year ago

I'm using Docker 23.0.1

nvidia-container-cli -V
cli-version: 1.12.0
lib-version: 1.12.0
build date: 2023-02-13T22:52+00:00
build revision:
build compiler: x86_64-linux-gnu-gcc-9 9.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -g -O2 -fdebug-prefix-map=/build/libnvidia-container-QG7FJq/libnvidia-container-1.12.0+dfsg=. -fstack-protector-strong -Wformat -Werror=format-security -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-Bsymbolic-functions -Wl,-z,relro

jkulhanek commented 1 year ago

Can you check if the library exists on your host system? It should have been a part of the nvidia driver installation.

neekolas commented 1 year ago

Hmmm. find / -name libnvoptix.so.1 doesn't come up with any results. I'll try and reinstall/update the nvidia driver and see if that helps. The drivers are just whatever comes standard in Lambda Cloud instances.

neekolas commented 1 year ago

After re-installing the Nvidia drivers on the host via apt-get, it seems to be working!

Thanks so much for all the help. I'll let you know how the results look in 13 hours or so.

jkulhanek commented 1 year ago

I am glad you were able to find the source of the error. Thank you for investing your time into debugging.

liuxiaozhu01 commented 1 year ago

I met the question either. Does the libnvoptix.so.1 occur as long as re-installing the nvidia driver? I have no permission to modify the driver😭 I am wondering whether there is any other solutions?

jkulhanek commented 1 year ago

I am sorry, but I believe the only solution is fixing the driver. If you try manually copying the libnvoptix lib (matching version) and adding it to the library path, that could also work, but unlikely.

liuxiaozhu01 commented 1 year ago

I am sorry, but I believe the only solution is fixing the driver. If you try manually copying the libnvoptix lib (matching version) and adding it to the library path, that could also work, but unlikely.

thank you for your reply. I will go on try it.

ThomasWarn commented 1 year ago

Same issue with the docker, here's how I fixed it for anyone looking here:

installed optix manually from https://developer.nvidia.com/designworks/optix/download - downloads NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64.sh & then extracts into the folder "NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64" un-installing then re-installing nvidia drivers.

Pointing to the location of my data & pointing the optix install path.

docker run -v C:/Users/Thomas/Desktop/PythonProjects/Nerf/Nerfstudio/Colmap_Data/TempDocker:/workspace/mydata --rm -it --gpus all -p 7007:7007 kulhanek/tetra-nerf:latest 
export OPTIX_PATH=$workspace/mydata/NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64
export NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility

jkulhanek / tetra-nerf

OPTIX_ERROR_LIBRARY_NOT_FOUND when running NS Train #5

Steps To Reproduce

This is where things fail