nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.27k stars 1.25k forks source link

instant-ngp freezes #1225

Open faad3 opened 1 year ago

faad3 commented 1 year ago

I'm trying to train instant-ngp, but the process seems to hang and nothing happens either in the terminal or in gui.

Running from docker dromni/nerfstudio:0.1.14

The command: ns-train instant-ngp --data data/nerfstudio/poster Output: [15:04:54] Using --data alias for --data.pipeline.datamanager.dataparser.data train.py:223 ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── ExperimentConfig( output_dir=PosixPath('outputs'), method_name='instant-ngp', experiment_name=None, timestamp='2023-01-12_150454', machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'), logging=LoggingConfig( relative_log_dir=PosixPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'> ), max_log_size=10 ), enable_profiler=True ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', start_train=True, zmq_port=None, launch_bridge_server=True, websocket_port=7007, ip_address='127.0.0.1', num_rays_per_chunk=64000, max_num_display_images=512, quit_on_train_completion=False, skip_openrelay=False ), trainer=TrainerConfig( steps_per_save=2000, steps_per_eval_batch=500, steps_per_eval_image=500, steps_per_eval_all_images=25000, max_num_iterations=30000, mixed_precision=True, relative_model_dir=PosixPath('nerfstudio_models'), save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None ), pipeline=DynamicBatchPipelineConfig( _target=<class 'nerfstudio.pipelines.dynamic_batch.DynamicBatchPipeline'>, datamanager=VanillaDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=PosixPath('data/nerfstudio/poster'), scale_factor=1.0, downscale_factor=None, scene_scale=1.0, orientation_method='up', center_poses=True, auto_scale_poses=True, train_split_percentage=0.9 ), train_num_rays_per_batch=8192, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=1024, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='off', position_noise_std=0.0, orientation_noise_std=0.0, optimizer=AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.0006, eps=1e-15, weight_decay=0 ), scheduler=SchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecaySchedule'>, lr_final=5e-06, max_steps=10000 ), param_group='camera_opt' ), camera_res_scale_factor=1.0 ), model=InstantNGPModelConfig( _target=<class 'nerfstudio.models.instant_ngp.NGPModel'>, enable_collider=False, collider_params=None, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=8192, max_num_samples_per_ray=24, grid_resolution=128, contraction_type=<ContractionType.UN_BOUNDED_SPHERE: 2>, cone_angle=0.004, render_step_size=0.01, near_plane=0.05, far_plane=1000.0, use_appearance_embedding=False, background_color='random' ), target_num_samples=262144, max_num_samples_per_ray=1024 ), optimizers={ 'fields': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, weight_decay=0 ), 'scheduler': None } }, vis='tensorboard', data=PosixPath('data/nerfstudio/poster') ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [15:04:54] Saving config to: experiment_config.py:122 outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454/config.yml [15:04:54] Saving checkpoints to: trainer.py:90 outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454/nerfstudio_models logging events to: outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454 [15:04:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [15:04:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 [15:04:56] Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch

And after that, nothing happens. Maybe I'm doing something wrong? Thank you in advance.

ccysway commented 1 year ago

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed

This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

shuimoo commented 1 year ago

I met the same issue, too.

meneldil12555 commented 1 year ago

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed

This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

Same problem. This solution is helpful.

Vathys commented 1 year ago

I delete the cache, but rerun the command. It is again killed, and the cycle repeats itself. Why is the process getting killed the first time over?

machenmusik commented 1 year ago

Probably first-run CUDA setup, e.g. nerfacc? (There is supposed to be a message to that effect in server console) You need to let it finish first time, then it should cache result for future runs. If you cancel out, it will redo again next try.

AndreeInCodeLand commented 10 months ago

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

Same problem. This solution is helpful.

For those using windows, I found the temp files in C:\Users\\AppData\Local\torch_extensions\py38_cu118. After that I had to rebuild nerfacc with pip using: pip install git+https://github.com/KAIR-BAIR/nerfacc.git