nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.38k stars 1.27k forks source link

Not possible to training when training with `--machine.num-gpus 2` #2128

Open jayhsu0627 opened 1 year ago

jayhsu0627 commented 1 year ago

Describe the bug

GPU: NVIDIA RTX A6000

Using the tutorial in first nerf, unable to initialize a training by using the command: # 1 GPU (8192 rays per GPU per batch) export CUDA_VISIBLE_DEVICES=0 ns-train nerfacto-big --vis viewer+wandb --machine.num-gpus 1 --pipeline.datamanager.train-num-rays-per-batch 4096 --data data/nerfstudio/aspen or # 2 GPUs (4096 rays per GPU per batch, effectively 8192 rays per batch) export CUDA_VISIBLE_DEVICES=0,1 ns-train nerfacto --vis viewer+wandb --machine.num-gpus 2 --pipeline.datamanager.train-num-rays-per-batch 4096 --data data/nerfstudio/aspen got this error:

usage: ns-train nerfacto-big [-h] [--output-dir PATH] [--method-name {None}|STR] [--experiment-name {None}|STR] [--project-name {None}|STR] [--timestamp STR]
                             [--machine.seed INT] [--machine.num-devices INT] [--machine.num-machines INT] [--machine.machine-rank INT] [--machine.dist-url STR]
                             [--machine.device-type {cpu,cuda,mps}] [--logging.relative-log-dir PATH] [--logging.steps-per-log INT] [--logging.max-buffer-size INT]
                             [--logging.local-writer.enable {True,False}]
                             [--logging.local-writer.stats-to-track 
{ITER_TRAIN_TIME,TOTAL_TRAIN_TIME,ETA,TRAIN_RAYS_PER_SEC,TEST_RAYS_PER_SEC,VIS_RAYS_PER_SEC,CURR_TEST_PSNR} [...]]
                             [--logging.local-writer.max-log-size INT] [--logging.profiler {none,basic,pytorch}] [--viewer.relative-log-filename STR]
                             [--viewer.websocket-port {None}|INT] [--viewer.websocket-port-default INT] [--viewer.websocket-host STR]
                             [--viewer.num-rays-per-chunk INT] [--viewer.max-num-display-images INT] [--viewer.quit-on-train-completion {True,False}]
                             [--viewer.image-format {jpeg,png}] [--viewer.jpeg-quality INT] [--pipeline.datamanager.data {None}|PATH]
                             [--pipeline.datamanager.camera-optimizer.mode {off,SO3xR3,SE3}] [--pipeline.datamanager.camera-optimizer.position-noise-std FLOAT]
                             [--pipeline.datamanager.camera-optimizer.orientation-noise-std FLOAT] [--pipeline.datamanager.camera-optimizer.optimizer.lr FLOAT]
                             [--pipeline.datamanager.camera-optimizer.optimizer.eps FLOAT] [--pipeline.datamanager.camera-optimizer.optimizer.max-norm {None}|FLOAT]
                             [--pipeline.datamanager.camera-optimizer.optimizer.weight-decay FLOAT]
                             [--pipeline.datamanager.camera-optimizer.scheduler.lr-pre-warmup FLOAT]
                             [--pipeline.datamanager.camera-optimizer.scheduler.lr-final {None}|FLOAT]
                             [--pipeline.datamanager.camera-optimizer.scheduler.warmup-steps INT] [--pipeline.datamanager.camera-optimizer.scheduler.max-steps INT]
                             [--pipeline.datamanager.camera-optimizer.scheduler.ramp {linear,cosine}] [--pipeline.datamanager.masks-on-gpu {None,True,False}]
                             [--pipeline.datamanager.train-num-rays-per-batch INT] [--pipeline.datamanager.train-num-images-to-sample-from INT]
                             [--pipeline.datamanager.train-num-times-to-repeat-images INT] [--pipeline.datamanager.eval-num-rays-per-batch INT]
                             [--pipeline.datamanager.eval-num-images-to-sample-from INT] [--pipeline.datamanager.eval-num-times-to-repeat-images INT]
                             [--pipeline.datamanager.eval-image-indices {None}|{INT [INT ...]}] [--pipeline.datamanager.camera-res-scale-factor FLOAT]
                             [--pipeline.datamanager.patch-size INT] [--pipeline.model.enable-collider {True,False}]
                             [--pipeline.model.collider-params {None}|{STR FLOAT [STR FLOAT ...]}] [--pipeline.model.loss-coefficients.rgb-loss-coarse FLOAT]
                             [--pipeline.model.loss-coefficients.rgb-loss-fine FLOAT] [--pipeline.model.eval-num-rays-per-chunk INT]
                             [--pipeline.model.prompt {None}|STR] [--pipeline.model.near-plane FLOAT] [--pipeline.model.far-plane FLOAT]
                             [--pipeline.model.background-color {random,last_sample,black,white}] [--pipeline.model.hidden-dim INT]
                             [--pipeline.model.hidden-dim-color INT] [--pipeline.model.hidden-dim-transient INT] [--pipeline.model.num-levels INT]
                             [--pipeline.model.base-res INT] [--pipeline.model.max-res INT] [--pipeline.model.log2-hashmap-size INT]
                             [--pipeline.model.features-per-level INT] [--pipeline.model.num-proposal-samples-per-ray INT [INT ...]]
                             [--pipeline.model.num-nerf-samples-per-ray INT] [--pipeline.model.proposal-update-every INT] [--pipeline.model.proposal-warmup INT]
                             [--pipeline.model.num-proposal-iterations INT] [--pipeline.model.use-same-proposal-network {True,False}]
                             [--pipeline.model.proposal-net-args-list.0.hidden-dim INT] [--pipeline.model.proposal-net-args-list.0.log2-hashmap-size INT]
                             [--pipeline.model.proposal-net-args-list.0.num-levels INT] [--pipeline.model.proposal-net-args-list.0.max-res INT]
                             [--pipeline.model.proposal-net-args-list.0.use-linear {True,False}] [--pipeline.model.proposal-net-args-list.1.hidden-dim INT]
                             [--pipeline.model.proposal-net-args-list.1.log2-hashmap-size INT] [--pipeline.model.proposal-net-args-list.1.num-levels INT]
                             [--pipeline.model.proposal-net-args-list.1.max-res INT] [--pipeline.model.proposal-net-args-list.1.use-linear {True,False}]
                             [--pipeline.model.proposal-initial-sampler {piecewise,uniform}] [--pipeline.model.interlevel-loss-mult FLOAT]
                             [--pipeline.model.distortion-loss-mult FLOAT] [--pipeline.model.orientation-loss-mult FLOAT]
                             [--pipeline.model.pred-normal-loss-mult FLOAT] [--pipeline.model.use-proposal-weight-anneal {True,False}]
                             [--pipeline.model.use-average-appearance-embedding {True,False}] [--pipeline.model.proposal-weights-anneal-slope FLOAT]
                             [--pipeline.model.proposal-weights-anneal-max-num-iters INT] [--pipeline.model.use-single-jitter {True,False}]
                             [--pipeline.model.predict-normals {True,False}] [--pipeline.model.disable-scene-contraction {True,False}]
                             [--pipeline.model.use-gradient-scaling {True,False}] [--pipeline.model.implementation {tcnn,torch}]
                             [--pipeline.model.appearance-embed-dim INT] [--optimizers.proposal-networks.optimizer.lr FLOAT]
                             [--optimizers.proposal-networks.optimizer.eps FLOAT] [--optimizers.proposal-networks.optimizer.max-norm {None}|FLOAT]
                             [--optimizers.proposal-networks.optimizer.weight-decay FLOAT] [--optimizers.proposal-networks.scheduler {None}]
                             [--optimizers.fields.optimizer.lr FLOAT] [--optimizers.fields.optimizer.eps FLOAT] [--optimizers.fields.optimizer.max-norm {None}|FLOAT]
                             [--optimizers.fields.optimizer.weight-decay FLOAT] [--optimizers.fields.scheduler.lr-pre-warmup FLOAT]
                             [--optimizers.fields.scheduler.lr-final {None}|FLOAT] [--optimizers.fields.scheduler.warmup-steps INT]
                             [--optimizers.fields.scheduler.max-steps INT] [--optimizers.fields.scheduler.ramp {linear,cosine}]
                             [--vis {viewer,wandb,tensorboard,viewer+wandb,viewer+tensorboard,viewer_beta}] [--data {None}|PATH] [--prompt {None}|STR]
                             [--relative-model-dir PATH] [--steps-per-save INT] [--steps-per-eval-batch INT] [--steps-per-eval-image INT]
                             [--steps-per-eval-all-images INT] [--max-num-iterations INT] [--mixed-precision {True,False}] [--use-grad-scaler {True,False}]
                             [--save-only-latest-checkpoint {True,False}] [--load-dir {None}|PATH] [--load-step {None}|INT] [--load-config {None}|PATH]
                             [--load-checkpoint {None}|PATH] [--log-gradients {True,False}] [--gradient-accumulation-steps INT]
                             [{nerfstudio-data,minimal-parser,arkit-data,blender-data,instant-ngp-data,nuscenes-data,dnerf-data,phototourism-data,dycheck-data,scannet
-data,sdfstudio-data,nerfosr-data,sitcoms3d-data}]

ns-train nerfacto-big: error: argument [{nerfstudio-data,minimal-parser,arkit-data,blender-data,instant-ngp-data,nuscenes-data,dnerf-data,phototourism-data,dycheck-data,scannet-data,sdfstudio-data,nerfosr-data,sitcoms3d-data}]: invalid choice: '1' (choose from 'nerfstudio-data', 'minimal-parser', 'arkit-data', 'blender-data', 'instant-ngp-data', 'nuscenes-data', 'dnerf-data', 'phototourism-data', 'dycheck-data', 'scannet-data', 'sdfstudio-data', 'nerfosr-data', 'sitcoms3d-data')

To Reproduce Build nightly version by building from source:

git clone https://github.com/nerfstudio-project/nerfstudio.git
cd nerfstudio
pip install --upgrade pip setuptools
pip install -e .

then goes to the multi-gpu tutorial part

Possible solution changed --machine.num-gpus into --machine.num-devices might work

ns-train nerfacto --vis viewer --machine.num-devices 2 --data data/nerfstudio/poster

but it still stop after dataloader, viewer won't show

           Saving config to: outputs/poster/nerfacto/2023-06-25_012759/config.yml               experiment_config.py:134
[01:28:02] Saving checkpoints to: outputs/poster/nerfacto/2023-06-25_012759/nerfstudio_models             trainer.py:135
[01:28:02] Saving checkpoints to: outputs/poster/nerfacto/2023-06-25_012759/nerfstudio_models             trainer.py:135
           Auto image downscale factor of 2                                                 nerfstudio_dataparser.py:324
           Auto image downscale factor of 2                                                 nerfstudio_dataparser.py:324
Setting up training dataset...
Caching all 204 images.
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--Setting up training dataset...
Caching all 204 images.
Setting up evaluation dataset...
Caching all 22 images.
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--Setting up evaluation dataset...
Caching all 22 images.
AX-I commented 1 year ago

Yes seems like the option was changed to num-devices. After the dataloader it does take a while to initialize. If you check cpu usage the process should be running at 100% then just wait.