Closed neekolas closed 1 year ago
This could be an old nvidia driver. What is the driver version on the host system? Is it supported by Optix7.5?
I also believe you need to add this “ -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility” to your run command. Can you try it if it helps?
Driver Version: 525.85.12
, which looks fine for Optix 7.5.
I just tried setting the NVIDIA_DRIVER_CAPABILITIES
environment variable and it didn't have any effect.
Any other ideas?
Can you if you get the same error with the dockerfile in this repo?
Same result after cloning this repo and building the image with Optix 7.5 via
sudo docker build -t tetra-nerf:latest --build-context optix=$HOME/optix .
[+] Building 41.9s (17/17) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.38kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04 0.7s
=> [context optix] load .dockerignore 0.0s
=> => transferring optix: 2B 0.0s
=> [auth] nvidia/cuda:pull token for registry-1.docker.io 0.0s
=> [context optix] load from client 0.0s
=> => transferring optix: 35.18kB 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 1.91MB 0.0s
=> [stage-0 1/9] FROM docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04@sha256:247e6d7676f8af28ed87f343620505d823dc86c22570ead2ac59049a2583534f 0.0s
=> CACHED [stage-0 2/9] COPY --from=optix . /opt/optix 0.0s
=> CACHED [stage-0 3/9] RUN if [ ! -e /opt/optix/include/optix.h ]; then echo "Could not find the OptiX library. Please install the Optix SDK and add the following argument to the buildx command: --build-context optix=/path/to/the/SDK"; exit 1; fi && apt-get update && 0.0s
=> CACHED [stage-0 4/9] RUN export PIP_ROOT_USER_ACTION=ignore && pip install --upgrade pip && pip uninstall -y functorch && pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html && pip install nerfst 0.0s
=> CACHED [stage-0 5/9] RUN pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch 0.0s
=> CACHED [stage-0 6/9] RUN adduser --disabled-password user --gecos "First Last,RoomNumber,WorkPhone,HomePhone" 0.0s
=> CACHED [stage-0 7/9] WORKDIR /home/user 0.0s
=> [stage-0 8/9] COPY --chown=user . /home/user/tetra-nerf 0.0s
=> [stage-0 9/9] RUN pip install -e tetra-nerf 41.1s
=> exporting to image 0.2s
=> => exporting layers 0.2s
=> => writing image sha256:e44e6a6b2810dec484893b0014e462e08e1a634c5e76a2558f3e81b861372a3f 0.0s
=> => naming to docker.io/library/tetra-nerf:latest
sudo docker run --rm -it -v $HOME/nerf-data:/workspace -e NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility --gpus all tetra-nerf:latest
==========
== CUDA ==
==========
CUDA Version 11.7.1
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
user@dbe3cc25cd7c:~$ export NERFSTUDIO_METHOD_CONFIGS="tetra-nerf=tetranerf.nerfstudio.registration:tetranerf"
user@dbe3cc25cd7c:~$ export FOLDER=/workspace/inputs/bedroom-gimbal-3
user@dbe3cc25cd7c:~$ ns-train tetra-nerf --pipeline.model.tetrahedra-path $FOLDER/sparse.th minimal-parser --data $FOLDER
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
_target=<class 'nerfstudio.engine.trainer.Trainer'>,
output_dir=PosixPath('outputs'),
method_name='tetra-nerf',
experiment_name=None,
timestamp='2023-05-06_173412',
machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'),
logging=LoggingConfig(
relative_log_dir=PosixPath('.'),
steps_per_log=10,
max_buffer_size=20,
local_writer=LocalWriterConfig(
_target=<class 'nerfstudio.utils.writer.LocalWriter'>,
enable=True,
stats_to_track=(
<EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
<EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
<EventName.CURR_TEST_PSNR: 'Test PSNR'>,
<EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
<EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
<EventName.ETA: 'ETA (time)'>
),
max_log_size=10
),
enable_profiler=True
),
viewer=ViewerConfig(
relative_log_filename='viewer_log_filename.txt',
websocket_port=None,
websocket_port_default=7007,
num_rays_per_chunk=32768,
max_num_display_images=512,
quit_on_train_completion=False,
image_format='jpeg',
jpeg_quality=90
),
pipeline=VanillaPipelineConfig(
_target=<class 'tetranerf.nerfstudio.pipeline.TetrahedraNerfPipeline'>,
datamanager=VanillaDataManagerConfig(
_target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>,
data=None,
camera_optimizer=CameraOptimizerConfig(
_target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
mode='off',
position_noise_std=0.0,
orientation_noise_std=0.0,
optimizer=AdamOptimizerConfig(
_target=<class 'torch.optim.adam.Adam'>,
lr=0.0006,
eps=1e-15,
max_norm=None,
weight_decay=0
),
scheduler=ExponentialDecaySchedulerConfig(
_target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
lr_pre_warmup=1e-08,
lr_final=None,
warmup_steps=0,
max_steps=10000,
ramp='cosine'
),
param_group='camera_opt'
),
dataparser=MinimalDataParserConfig(
_target=<class 'nerfstudio.data.dataparsers.minimal_dataparser.MinimalDataParser'>,
data=PosixPath('/workspace/inputs/bedroom-gimbal-3')
),
train_num_rays_per_batch=4096,
train_num_images_to_sample_from=-1,
train_num_times_to_repeat_images=-1,
eval_num_rays_per_batch=4096,
eval_num_images_to_sample_from=-1,
eval_num_times_to_repeat_images=-1,
eval_image_indices=(0,),
camera_res_scale_factor=1.0,
patch_size=1
),
model=TetrahedraNerfConfig(
_target=<class 'tetranerf.nerfstudio.model.TetrahedraNerf'>,
enable_collider=True,
collider_params={'near_plane': 2.0, 'far_plane': 6.0},
loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
eval_num_rays_per_chunk=4096,
tetrahedra_path=PosixPath('/workspace/inputs/bedroom-gimbal-3/sparse.th'),
num_tetrahedra_vertices=245069,
num_tetrahedra_cells=1525505,
max_intersected_triangles=512,
num_samples=128,
num_fine_samples=128,
use_biased_sampler=True,
field_dim=64,
num_color_layers=1,
num_density_layers=3,
hidden_size=128,
input_fourier_frequencies=0,
initialize_colors=True
)
),
optimizers={
'fields': {
'optimizer': RAdamOptimizerConfig(
_target=<class 'torch.optim.radam.RAdam'>,
lr=0.001,
eps=1e-08,
max_norm=None,
weight_decay=0
),
'scheduler': ExponentialDecaySchedulerConfig(
_target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
lr_pre_warmup=1e-08,
lr_final=0.0001,
warmup_steps=0,
max_steps=300000,
ramp='cosine'
)
}
},
vis='wandb',
data=None,
relative_model_dir=PosixPath('nerfstudio_models'),
steps_per_save=25000,
steps_per_eval_batch=1000,
steps_per_eval_image=2000,
steps_per_eval_all_images=50000,
max_num_iterations=300000,
mixed_precision=False,
save_only_latest_checkpoint=True,
load_dir=None,
load_step=None,
load_config=None,
log_gradients=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[17:34:12] Saving config to: outputs/unnamed/tetra-nerf/2023-05-06_173412/config.yml experiment_config.py:129
[17:34:12] Saving checkpoints to: outputs/unnamed/tetra-nerf/2023-05-06_173412/nerfstudio_models trainer.py:132
Setting up training dataset...
Caching all 531 images.
Warning: If you run out of memory, try reducing the number of images to sample from.
Setting up evaluation dataset...
Caching all 76 images.
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /home/user/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 233M/233M [00:02<00:00, 86.9MB/s]
No checkpoints to load, training from scratch
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.1
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
logging events to: outputs/unnamed/tetra-nerf/2023-05-06_173412
Tetrahedra initialized from file /workspace/inputs/bedroom-gimbal-3/sparse.th:
Num points: 245069
Num tetrahedra: 1525505
Printing profiling stats, from longest to shortest duration in seconds
Traceback (most recent call last):
File "/usr/local/bin/ns-train", line 8, in <module>
sys.exit(entrypoint())
File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 247, in entrypoint
main(
File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 233, in main
launch(
File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 172, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/usr/local/lib/python3.10/dist-packages/scripts/train.py", line 87, in train_loop
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 232, in train
loss, loss_dict, metrics_dict = self.train_iteration(step)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 43, in wrapper
ret = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 406, in train_iteration
_, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/utils/profiler.py", line 43, in wrapper
ret = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 278, in get_train_loss_dict
model_outputs = self.model(ray_bundle)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/base_model.py", line 140, in forward
return self.get_outputs(ray_bundle)
File "/home/user/tetra-nerf/tetranerf/nerfstudio/model.py", line 422, in get_outputs
tracer = self.get_tetrahedra_tracer()
File "/home/user/tetra-nerf/tetranerf/nerfstudio/model.py", line 320, in get_tetrahedra_tracer
self._tetrahedra_tracer = TetrahedraTracer(device)
RuntimeError: OPTIX_ERROR_LIBRARY_NOT_FOUND: Optix call 'optixInit()' failed: /home/user/tetra-nerf/src/tetrahedra_tracer.cpp:148)
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync outputs/unnamed/tetra-nerf/2023-05-06_173412/wandb/offline-run-20230506_173422-vtpyj9uj
wandb: Find logs at: outputs/unnamed/tetra-nerf/2023-05-06_173412/wandb/offline-run-20230506_173422-vtpyj9uj/logs
user@dbe3cc25cd7c:~$ ls /opt/optix
SDK doc include
user@dbe3cc25cd7c:~$
Ok, can you post the output of running with strace as suggested here? https://forums.developer.nvidia.com/t/optix-error-failed-to-load-optix-library/70671/21 Also, what gpu do you use?
What cuda_compute does your GPU support?
My GPU supports cuda_compute 8.6 (NVIDIA A10).
I ran ns-train tetra-nerf ...
with strace. These were the most interesting log lines I could find.
futex(0x7f4c1339b518, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/usr/local/lib/python3.10/dist-packages/torch/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 34
newfstatat(34, "", {st_mode=S_IFREG|0644, st_size=42549, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 42549, PROT_READ, MAP_PRIVATE, 34, 0) = 0x7f4c3243f000
close(34) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/libnvoptix.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
Yes! It is missing the optix library libnvoptix (part of the nvidia driver, not the SDK). Same as here: https://github.com/NVIDIA/nvidia-container-toolkit/issues/187 What docker setup do you use? What version of docker and what version of nvidia container toolkit?
I'm using Docker 23.0.1
nvidia-container-cli -V
cli-version: 1.12.0
lib-version: 1.12.0
build date: 2023-02-13T22:52+00:00
build revision:
build compiler: x86_64-linux-gnu-gcc-9 9.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -g -O2 -fdebug-prefix-map=/build/libnvidia-container-QG7FJq/libnvidia-container-1.12.0+dfsg=. -fstack-protector-strong -Wformat -Werror=format-security -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-Bsymbolic-functions -Wl,-z,relro
Can you check if the library exists on your host system? It should have been a part of the nvidia driver installation.
Hmmm. find / -name libnvoptix.so.1
doesn't come up with any results. I'll try and reinstall/update the nvidia driver and see if that helps. The drivers are just whatever comes standard in Lambda Cloud instances.
After re-installing the Nvidia drivers on the host via apt-get
, it seems to be working!
Thanks so much for all the help. I'll let you know how the results look in 13 hours or so.
I am glad you were able to find the source of the error. Thank you for investing your time into debugging.
I met the question either. Does the libnvoptix.so.1 occur as long as re-installing the nvidia driver? I have no permission to modify the driver😭 I am wondering whether there is any other solutions?
I am sorry, but I believe the only solution is fixing the driver. If you try manually copying the libnvoptix lib (matching version) and adding it to the library path, that could also work, but unlikely.
I am sorry, but I believe the only solution is fixing the driver. If you try manually copying the libnvoptix lib (matching version) and adding it to the library path, that could also work, but unlikely.
thank you for your reply. I will go on try it.
Same issue with the docker, here's how I fixed it for anyone looking here:
installed optix manually from https://developer.nvidia.com/designworks/optix/download - downloads NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64.sh & then extracts into the folder "NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64" un-installing then re-installing nvidia drivers.
Pointing to the location of my data & pointing the optix install path.
docker run -v C:/Users/Thomas/Desktop/PythonProjects/Nerf/Nerfstudio/Colmap_Data/TempDocker:/workspace/mydata --rm -it --gpus all -p 7007:7007 kulhanek/tetra-nerf:latest
export OPTIX_PATH=$workspace/mydata/NVIDIA-OptiX-SDK-7.7.0-linux64-x86_64
export NVIDIA_DRIVER_CAPABILITIES=graphics,compute,utility
I am so close to getting this working, but am getting an error when running
ns-train
.Steps To Reproduce
This is all being run on a Lambda Cloud VM with an A10 GPU
nerf-data
folder(all future commands are run inside the container)
export NERFSTUDIO_METHOD_CONFIGS="tetra-nerf=tetranerf.nerfstudio.registration:tetranerf"
This is where things fail
ns-train tetra-nerf --pipeline.model.tetrahedra-path $FOLDER/sparse.th minimal-parser --data $FOLDER