nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.43k stars 1.28k forks source link

Training is not working #1519

Closed danielchiu615 closed 1 year ago

danielchiu615 commented 1 year ago

The programme ends and returns to "(nerfstudio) C:\Users\polyu>" when trying to run "ns-train nerfacto --data data/nerfstudio/poster" Anyone facing the same problem? I can complete this step with my own computer, and run the first two lines of the training process. but my pc crashes and shutdown everytime( i think its the problem with my pc cooling system) I repeat this step in a stronger pc provided by the university and its not working.

Versions: windows 11 enterprise Python 3.9 Git: latest VS 2022 community with C++ dev Cuda 11.7 Pytorch: 1.13.1

Log: (nerfstudio) C:\Users\polyu>ns-train nerfacto --data data/nerfstudio/poster --viewer.websocket-port 7011 [18:23:26] Using --data alias for --data.pipeline.datamanager.dataparser.data train.py:222 ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── TrainerConfig( _target=<class 'nerfstudio.engine.trainer.Trainer'>, output_dir=WindowsPath('outputs'), method_name='nerfacto', experiment_name=None, timestamp='2023-02-27_182326', machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'), logging=LoggingConfig( relative_log_dir=WindowsPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>, <EventName.ETA: 'ETA (time)'> ), max_log_size=10 ), enable_profiler=True ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', start_train=True, zmq_port=None, launch_bridge_server=True, websocket_port=7011, ip_address='127.0.0.1', num_rays_per_chunk=32768, max_num_display_images=512, quit_on_train_completion=False, skip_openrelay=False, codec='VP8', local=False ), pipeline=VanillaPipelineConfig( _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>, datamanager=VanillaDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=WindowsPath('data/nerfstudio/poster'), scale_factor=1.0, downscale_factor=None, scene_scale=1.0, orientation_method='up', center_poses=True, auto_scale_poses=True, train_split_percentage=0.9, depth_unit_scale_factor=0.001 ), train_num_rays_per_batch=4096, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=4096, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='SO3xR3', position_noise_std=0.0, orientation_noise_std=0.0, optimizer=AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.0006, eps=1e-08, max_norm=None, weight_decay=0.01 ), scheduler=SchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecaySchedule'>, lr_final=5e-06, max_steps=10000 ), param_group='camera_opt' ), camera_res_scale_factor=1.0 ), model=NerfactoModelConfig( _target=<class 'nerfstudio.models.nerfacto.NerfactoModel'>, enable_collider=True, collider_params={'near_plane': 2.0, 'far_plane': 6.0}, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=32768, near_plane=0.05, far_plane=1000.0, background_color='last_sample', num_levels=16, max_res=2048, log2_hashmap_size=19, num_proposal_samples_per_ray=(256, 96), num_nerf_samples_per_ray=48, proposal_update_every=5, proposal_warmup=5000, num_proposal_iterations=2, use_same_proposal_network=False, proposal_net_args_list=[ {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 128}, {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 256} ], interlevel_loss_mult=1.0, distortion_loss_mult=0.002, orientation_loss_mult=0.0001, pred_normal_loss_mult=0.001, use_proposal_weight_anneal=True, use_average_appearance_embedding=True, proposal_weights_anneal_slope=10.0, proposal_weights_anneal_max_num_iters=1000, use_single_jitter=True, predict_normals=False ) ), optimizers={ 'proposal_networks': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': None }, 'fields': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, max_norm=None, weight_decay=0 ), 'scheduler': None } }, vis='viewer', data=WindowsPath('data/nerfstudio/poster'), relative_model_dir=WindowsPath('nerfstudio_models'), steps_per_save=2000, steps_per_eval_batch=500, steps_per_eval_image=500, steps_per_eval_all_images=25000, max_num_iterations=30000, mixed_precision=True, save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None, log_gradients=False ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [18:23:26] Saving config to: experiment_config.py:124 outputs\data\nerfstudio\poster\nerfacto\2023-02-27_182326\config.yml [18:23:26] Saving checkpoints to: trainer.py:123 outputs\data\nerfstudio\poster\nerfacto\2023-02-27_182326\nerfstudio_models Using ZMQ port: 61414

======================================================================================================================== [Public] Open the viewer at https://viewer.nerf.studio/versions/23-02-3-0/?websocket_url=ws://localhost:7011

Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [18:23:27] Auto image downscale factor of 2 nerfstudio_dataparser.py:314 Skipping 0 files in dataset split train. nerfstudio_dataparser.py:165 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:165 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. No checkpoints to load, training from scratch

### (nerfstudio) C:\Users\polyu>

image

danielchiu615 commented 1 year ago

image

Images were uploaded but its not rendering


Thanks tancik for ur reply!

I tried to delete the related files directly from C:/Users/User and reinstalled everything (including git cuda python... etc, i also tried to download many versions of the dependencies) and repeated the steps in Nerfstudio:installation, and that issue happened. Would it be possible that its caused by incomplete/faulty uninstallation, or version collision? its happening on both my school pc (i started the installation clean) and my own pc now

tancik commented 1 year ago

It looks like the training isn't starting. It should print training steps as it trains (instead it looks like it just completes). I am unable to replicate this locally. Is anyone else having this issue?

cyhhhhhhit commented 1 year ago

hi, similar issue happened to me, when i trying to run "ns-train nerfacto --data data/***", it ends with "Aborted (core dumped)" after "No checkpoints to load, training from scratch". And the "nvidia-smi" outputs "GPU Detected Critical Xid Error"**. Some versions: GPU 3090 CUDA 12.0(system) CUDA 11.7(Anaconda) Pytorch: 1.13.1

tancik commented 1 year ago

@cyhhhhhhit Is this a new error, have you been able to successfully run nerstudio in the past? The nvidia-smi error makes me think that it is unrelated to nerfstudio.

cyhhhhhhit commented 1 year ago

@tancik This is my first time to train nerfstudio model, and i am not sure if the installation was successful, but i can ran other pytorch models in this environment.

LessThan12Parsecs commented 1 year ago

Hi, I´m having the same issue. It gets to ¨No checkpoints to load, training from scratch¨ and stops. Logs are not saving any error either

tancik commented 1 year ago

Given that there are no logged errors, debugging is a bit more tricky. Can you try installing a previous version of nerfstudio to see if it works?

danielchiu615 commented 1 year ago

Given that there are no logged errors, debugging is a bit more tricky. Can you try installing a previous version of nerfstudio to see if it works?

How do i install an older version? I have tried to reinstall everything (including switching to vs2019 and cuda 11.3) and it wont work

tancik commented 1 year ago

Can you try and let me know if this fixes the issue. pip uninstall tinycudann pip install git+https://github.com/NVlabs/tiny-cuda-nn.git@e4c147e36893bef3a065b1e5db69706951ea4d56#subdirectory=bindings/torch

zhangguochang commented 1 year ago

您能否尝试让我知道这是否可以解决问题。 pip uninstall tinycudann pip install git+https://github.com/NVlabs/tiny-cuda-nn.git@e4c147e36893bef3a065b1e5db69706951ea4d56#subdirectory=bindings/torch

It's right

danielchiu615 commented 1 year ago

Can you try and let me know if this fixes the issue. pip uninstall tinycudann pip install git+https://github.com/NVlabs/tiny-cuda-nn.git@e4c147e36893bef3a065b1e5db69706951ea4d56#subdirectory=bindings/torch

Thanks!!! It is working now. May I ask why is that?

btw currently im facing a new problem that the render is disconnected from the viewer. Im waiting the training to complete and looking for solutions on other posts image

danielchiu615 commented 1 year ago

I solved the "render disconnected" with the following steps:

  1. pip uninstall -y cryptography pip install cryptography==38

  2. ns-train nerfacto --data DATA --viewer.skip-openrelay True

  3. Pause the training and refresh the browser and wait

  4. only open one tab on chrome

Thanks you tancik and everyone, hope this will be helpful for others

iakasoyr commented 1 year ago

@danielchiu615 Hello, thank you for your useful comments. I also had the same issue and solved the "render disconnected" problem. However, I could not render nerf results. Did you render them?

My current web viewer: (I expect that I can render them as the top gif on the official documentation. https://docs.nerf.studio/en/latest/) image

tancik commented 1 year ago

Refer to https://github.com/nerfstudio-project/nerfstudio/issues/765

iakasoyr commented 1 year ago

@tancik Thank you for the reply. I will try to use the nerfstudio in my local environment. (Currently, I use my remote server and ssh port fowarding.)

cyhhhhhhit commented 1 year ago

@tancik hi, thank you for your useful suggestion. I fixed this issue(no logged errors and stops at "No checkpoints to load, training from scratch") by installing an older version(tinycudann1.4)

iakasoyr commented 1 year ago

@tancik I confirmed that the viewer can render trained(also training) scene in my local environment. OS version: Ubuntu 18.04 nerfstudio version: 0.1.18 Browser type: google-chrome (version:110.0.5481.177, Official Build, 64bit) Local or remote compute: local Internet setup: Corporate internet use command: ns-train nerfacto --data data/nerfstudio/poster --viewer.skip-openrelay True image use command: ns-train nerfacto --data data/nerfstudio/bww_entrance --viewer.skip-openrelay True image

However, I could not render it though remote environment and ssh port forwarding. local PC(windows 10) <-> remote PC(ubuntu 18.04, the above env.) (my local PC terminal) ssh my_remote_pc -L 7007:my_remote_pc:7007 Browser type in local PC: google-chrome (version:110.0.5481.178, Official Build, 64bit) If I fix this issue, I will write the solution. (I guess it is difficult...)

tancik commented 1 year ago

The port is likely being block with a firewall either one one of the machines or the network connection. My guess is it is due to the corporate network.