nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.61k stars 1.32k forks source link

Custom training is not seeing cuda, colmap and cuda issue? #2200

Open gschian0 opened 1 year ago

gschian0 commented 1 year ago

The viewer is working to play back the supplied scenes so I have access to the gpu using the docker container ... this is my Dockerfile

sudo docker run --gpus all -vpwd/datasets:/datasets -v 'pwd':/workspace/ -vpwd/.cache:/home/user/.cache/ -p 7007:7007 --rm -it 0043277ff719

I am running this image dromni/nerfstudio 0.3.2 0043277ff719 2 weeks ago 21.2GB

I am using NVIDIA NGC on VULTR

vCPU/s: 12 vCPUs RAM: 131072.00 MB Storage: 700 GB NVMe Bandwidth: [1.23 GB](javascript:changeTabSubmenu('subsusage')) Label: zen-boom OS: Ubuntu 22.04 LTS Application: [NVIDIA NGC](https://www.vultr.com/marketplace/apps/nvidia-ngc/) App Instructions output of nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A16-16Q On | 00000000:06:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 0MiB / 16384MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A16-16Q On | 00000000:07:00.0 Off | 0 | | N/A N/A P8 N/A / N/A | 0MiB / 16384MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ *Describe the bug** A clear and concise description of what the bug is. ERROR : ────────────────────────────────────────────── 💀 💀 💀 ERROR 💀 💀 💀 ─────────────────────────────────────────────── Error running command: colmap feature_extractor --database_path data/colmap/database.db --image_path data/images --ImageReader.single_camera 1 --ImageReader.camera_model OPENCV --SiftExtraction.use_gpu 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── CUDA error at /colmap/src/util/cuda.cc:59 - no CUDA-capable device is detected

To Reproduce Steps to reproduce the behavior: run this command ns-process-data 'video' --data data/g3dvid.mp4 --output-dir ./data/

Expected behavior A clear and concise description of what you expected to happen. process the video and create a nerf training

gschian0 commented 1 year ago

I think this error is related to whatever the problem is ... [NOTE] Not running eval iterations since only viewer is enabled. Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval. No Nerfstudio checkpoint to load, so training from scratch. Disabled tensorboard/wandb event writers GPUMemoryArena: Warning: GPU 0 does not support virtual memory. Falling back to regular allocations, which will be larger and can cause occasional stutter. Printing profiling stats, from longest to shortest duration in seconds Trainer.train_iteration: 3.4315 VanillaPipeline.get_train_loss_dict: 3.4051 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/user/.local/bin/ns-train:8 in │ │ │ │ 5 from nerfstudio.scripts.train import entrypoint │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(entrypoint()) │ │ 9 │ │ │ │ /home/user/nerfstudio/nerfstudio/scripts/train.py:261 in entrypoint │ │ │ │ 258 │ """Entrypoint for use with pyproject scripts.""" │ │ 259 │ # Choose a base configuration and override values. │ │ 260 │ tyro.extras.set_accent_color("bright_yellow") │ │ ❱ 261 │ main( │ │ 262 │ │ tyro.cli( │ │ 263 │ │ │ AnnotatedBaseConfigUnion, │ │ 264 │ │ │ description=convert_markup_to_ansi(doc), │ │ │ │ /home/user/nerfstudio/nerfstudio/scripts/train.py:246 in main │ │ │ │ 243 │ config.print_to_terminal() │ │ 244 │ config.save_config() │ │ 245 │ │ │ ❱ 246 │ launch( │ │ 247 │ │ main_func=train_loop, │ │ 248 │ │ num_devices_per_machine=config.machine.num_devices, │ │ 249 │ │ device_type=config.machine.device_type, │ │ │ │ /home/user/nerfstudio/nerfstudio/scripts/train.py:189 in launch │ │ │ │ 186 │ elif world_size == 1: │ │ 187 │ │ # uses one process │ │ 188 │ │ try: │ │ ❱ 189 │ │ │ main_func(local_rank=0, world_size=world_size, config=config) │ │ 190 │ │ except KeyboardInterrupt: │ │ 191 │ │ │ # print the stack trace │ │ 192 │ │ │ CONSOLE.print(traceback.format_exc()) │ │ │ │ /home/user/nerfstudio/nerfstudio/scripts/train.py:100 in train_loop │ │ │ │ 97 │ _set_random_seed(config.machine.seed + global_rank) │ │ 98 │ trainer = config.setup(local_rank=local_rank, world_size=world_size) │ │ 99 │ trainer.setup() │ │ ❱ 100 │ trainer.train() │ │ 101 │ │ 102 │ │ 103 def _distributed_worker( │ │ │ │ /home/user/nerfstudio/nerfstudio/engine/trainer.py:255 in train │ │ │ │ 252 │ │ │ │ │ │ │ ) │ │ 253 │ │ │ │ │ │ │ │ 254 │ │ │ │ │ │ # time the forward pass │ │ ❱ 255 │ │ │ │ │ │ loss, loss_dict, metrics_dict = self.train_iteration(step) │ │ 256 │ │ │ │ │ │ │ │ 257 │ │ │ │ │ │ # training callbacks after the training iteration │ │ 258 │ │ │ │ │ │ for callback in self.callbacks: │ │ │ │ /home/user/nerfstudio/nerfstudio/utils/profiler.py:127 in inner │ │ │ │ 124 │ │ def inner(*args, kwargs): │ │ 125 │ │ │ self._function_call_args = (args, kwargs) │ │ 126 │ │ │ with self: │ │ ❱ 127 │ │ │ │ out = func(*args, *kwargs) │ │ 128 │ │ │ self._function_call_args = None │ │ 129 │ │ │ return out │ │ 130 │ │ │ │ /home/user/nerfstudio/nerfstudio/engine/trainer.py:471 in trainiteration │ │ │ │ 468 │ │ │ │ , loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step │ │ 469 │ │ │ │ loss = functools.reduce(torch.add, loss_dict.values()) │ │ 470 │ │ │ │ loss /= self.gradient_accumulation_steps │ │ ❱ 471 │ │ │ self.grad_scaler.scale(loss).backward() # type: ignore │ │ 472 │ │ self.optimizers.optimizer_scaler_step_all(self.grad_scaler) │ │ 473 │ │ │ │ 474 │ │ if self.config.log_gradients: │ │ │ │ /home/user/.local/lib/python3.10/site-packages/torch/_tensor.py:487 in backward │ │ │ │ 484 │ │ │ │ create_graph=create_graph, │ │ 485 │ │ │ │ inputs=inputs, │ │ 486 │ │ │ ) │ │ ❱ 487 │ │ torch.autograd.backward( │ │ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │ │ 489 │ │ ) │ │ 490 │ │ │ │ /home/user/.local/lib/python3.10/site-packages/torch/autograd/init.py:200 in backward │ │ │ │ 197 │ # The reason we repeat same the comment below is that │ │ 198 │ # some Python versions print out the first line of a multi-line function │ │ 199 │ # calls in the traceback and some print out the last line │ │ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │ │ 201 │ │ tensors, gradtensors, retain_graph, create_graph, inputs, │ │ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │ │ 203 │ │ │ │ /home/user/.local/lib/python3.10/site-packages/torch/autograd/function.py:274 in apply │ │ │ │ 271 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │ │ 272 │ │ │ │ │ │ │ "of them.") │ │ 273 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │ │ ❱ 274 │ │ return user_fn(self, args) │ │ 275 │ │ │ 276 │ def apply_jvp(self, args): │ │ 277 │ │ # _forward_cls is defined by derived class │ │ │ │ /home/user/.local/lib/python3.10/site-packages/tinycudann/modules.py:107 in backward │ │ │ │ 104 │ │ │ doutput = doutput.cuda() │ │ 105 │ │ │ │ 106 │ │ input, params, output = ctx.saved_tensors │ │ ❱ 107 │ │ input_grad, params_grad = _module_function_backward.apply(ctx, doutput, input, p │ │ 108 │ │ │ │ 109 │ │ return None, null_tensor_to_none(input_grad), null_tensor_to_none(params_grad), │ │ 110 │ │ │ │ /home/user/.local/lib/python3.10/site-packages/torch/autograd/function.py:506 in apply │ │ │ │ 503 │ │ if not torch._C._are_functorch_transforms_active(): │ │ 504 │ │ │ # See NOTE: [functorch vjp and autograd interaction] │ │ 505 │ │ │ args = _functorch.utils.unwrap_dead_wrappers(args) │ │ ❱ 506 │ │ │ return super().apply(args, kwargs) # type: ignore[misc] │ │ 507 │ │ │ │ 508 │ │ if cls.setup_context == _SingleLevelFunction.setup_context: │ │ 509 │ │ │ raise RuntimeError( │ │ │ │ /home/user/.local/lib/python3.10/site-packages/tinycudann/modules.py:118 in forward │ │ │ │ 115 │ │ ctx.save_for_backward(input, params, doutput) │ │ 116 │ │ with torch.no_grad(): │ │ 117 │ │ │ scaled_grad = doutput ctx_fwd.loss_scale │ │ ❱ 118 │ │ │ input_grad, params_grad = ctx_fwd.native_tcnn_module.bwd(ctx_fwd.native_ctx, │ │ 119 │ │ │ input_grad = null_tensor_like(input) if input_grad is None else (input_grad │ │ 120 │ │ │ params_grad = null_tensor_like(params) if params_grad is None else (params_g │ │ 121 │ │ return input_grad, paramsgrad │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Could not allocate memory: /tmp/pip-req-build-vhymsuw/include/tiny-cuda-nn/gpu_memory.h:123 cudaMalloc(&rawptr, n_bytes+DEBUG_GUARD_SIZE2) failed with error out of memory

gschian0 commented 1 year ago

i'm using the 2 gpus with the train using command ns-train nerfacto --data data/nerfstudio/poster --machine.num-devices 2 i'm not sure how to do that with the custom data...

tancik commented 1 year ago

Multi-gpu doesn't work well with the nerfacto model.