nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.52k stars 1.3k forks source link

Deadlock situation in xterm from Colab demo #3411

Closed luh-j closed 1 month ago

luh-j commented 2 months ago

I've encountered in a situation when I tried to run the colab demo with either custom data or the data downloaded in the Training section. While the Xterm was launched, it always stuck at a runtimewarning about os.fork(): /content# ns-train nerfacto --viewer.websocket-port 7007 --viewer.make-share-url True nerfstudio-data --data data/nerfstudio/poster --downscale-factor 4 2024-09-04 23:42:57.343180: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-04 23:42:57.377353: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-04 23:42:57.387574: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-04 23:42:57.410580: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-09-04 23:42:59.152349: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── TrainerConfig( _target=<class 'nerfstudio.engine.trainer.Trainer'>, output_dir=PosixPath('outputs'), method_name='nerfacto', experiment_name=None, project_name='nerfstudio-project', timestamp='2024-09-04_234307', machine=MachineConfig(seed=42, num_devices=1, num_machines=1, machine_rank=0, dist_url='auto', device_type='cuda'), logging=LoggingConfig( relative_log_dir=PosixPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>, <EventName.ETA: 'ETA (time)'> ), max_log_size=10 ), profiler='basic' ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', websocket_port=7007, websocket_port_default=7007, websocket_host='0.0.0.0', num_rays_per_chunk=32768, max_num_display_images=512, quit_on_train_completion=False, image_format='jpeg', jpeg_quality=75, make_share_url=True, camera_frustum_scale=0.1, default_composite_depth=True ), pipeline=VanillaPipelineConfig( _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>, datamanager=ParallelDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.parallel_datamanager.ParallelDataManager'>, data=None, masks_on_gpu=False, images_on_gpu=False, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=PosixPath('data/nerfstudio/poster'), scale_factor=1.0, downscale_factor=4, scene_scale=1.0, orientation_method='up', center_method='poses', auto_scale_poses=True, eval_mode='fraction', train_split_fraction=0.9, eval_interval=8, depth_unit_scale_factor=0.001, mask_color=None, load_3D_points=False ), train_num_rays_per_batch=4096, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=4096, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), collate_fn=<function nerfstudio_collate at 0x7a3b2104e7a0>, camera_res_scale_factor=1.0, patch_size=1, camera_optimizer=None, pixel_sampler=PixelSamplerConfig( _target=<class 'nerfstudio.data.pixel_samplers.PixelSampler'>, num_rays_per_batch=4096, keep_full_image=False, is_equirectangular=False, ignore_mask=False, fisheye_crop_radius=None, rejection_sample_mask=True, max_num_iterations=100 ), num_processes=1, queue_size=2, max_thread_workers=None ), model=NerfactoModelConfig( _target=<class 'nerfstudio.models.nerfacto.NerfactoModel'>, enable_collider=True, collider_params={'near_plane': 2.0, 'far_plane': 6.0}, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=32768, prompt=None, near_plane=0.05, far_plane=1000.0, background_color='last_sample', hidden_dim=64, hidden_dim_color=64, hidden_dim_transient=64, num_levels=16, base_res=16, max_res=2048, log2_hashmap_size=19, features_per_level=2, num_proposal_samples_per_ray=(256, 96), num_nerf_samples_per_ray=48, proposal_update_every=5, proposal_warmup=5000, num_proposal_iterations=2, use_same_proposal_network=False, proposal_net_args_list=[ {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 128, 'use_linear': False}, {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 256, 'use_linear': False} ], proposal_initial_sampler='piecewise', interlevel_loss_mult=1.0, distortion_loss_mult=0.002, orientation_loss_mult=0.0001, pred_normal_loss_mult=0.001, use_proposal_weight_anneal=True, use_appearance_embedding=True, use_average_appearance_embedding=True, proposal_weights_anneal_slope=10.0, proposal_weights_anneal_max_num_iters=1000, use_single_jitter=True, predict_normals=False, disable_scene_contraction=False, use_gradient_scaling=False, implementation='tcnn', appearance_embed_dim=32, average_init_density=0.01, camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='SO3xR3', trans_l2_penalty=0.01, rot_l2_penalty=0.001, optimizer=None, scheduler=None ) ) ), optimizers={ 'proposal_networks': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=200000, ramp='cosine' ) }, 'fields': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=200000, ramp='cosine' ) }, 'camera_opt': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.001, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': ExponentialDecaySchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>, lr_pre_warmup=1e-08, lr_final=0.0001, warmup_steps=0, max_steps=5000, ramp='cosine' ) } }, vis='viewer', data=None, prompt=None, relative_model_dir=PosixPath('nerfstudio_models'), load_scheduler=True, steps_per_save=2000, steps_per_eval_batch=500, steps_per_eval_image=500, steps_per_eval_all_images=25000, max_num_iterations=30000, mixed_precision=True, use_grad_scaler=False, save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None, load_checkpoint=None, log_gradients=False, gradient_accumulation_steps={} ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [23:43:08] Saving config to: outputs/unnamed/nerfacto/2024-09-04_234307/config.yml experiment_config.py:136 Saving checkpoints to: outputs/unnamed/nerfacto/2024-09-04_234307/nerfstudio_models trainer.py:138 2024-09-04 23:43:12.232279: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-09-04 23:43:12.268906: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-09-04 23:43:12.279597: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-04 23:43:14.233643: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Started threads Setting up evaluation dataset... /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Caching all 22 images. Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01 ╭─────────────── viser ───────────────╮ │ ╷ │ │ HTTP │ http://0.0.0.0:7008 │ │ Websocket │ ws://0.0.0.0:7008 │ │ ╵ │ ╰─────────────────────────────────────╯ (viser) Share URL requested! /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() For the colab demo I copied the code for setting up colmap to the original code and left nothing changed other than that. Under the output directory only the config.yml exists, but not the nerfstudio_models.

AntonioMacaronio commented 2 months ago

I was not able to replicate this issue, I only made one change to the notebook: I added the cell !pip install numpy==1.24.3 because Numpy 2 has been troubling, but otherwise I did not have the deadlock occur and I was able to sucessfully train.

If you are trying to run COLMAP, try this PR: https://github.com/nerfstudio-project/nerfstudio/pull/2877

luh-j commented 1 month ago

that works for me! ty