训练过程问题 - Githubissues

stw2021 commented 5 months ago

Dear Authors, Thanks for the amazing work! when I run: torchrun --nproc_per_node=4 --master_port 4321 train.py gpus=[0] num_workers=4 name=BP_KITTI net=PMP data=KITTI lr=1e-3 train_batch_size=2 test_batch_size=2 sched/lr=NoiseOneCycleCosMo sched.lr.policy.max_momentum=0.90 nepoch=30 test_epoch=25 ++net.sbn=true There are outputs:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Error executing job with overrides: ['gpus=[0]', 'num_workers=4', 'name=BP_KITTI', 'net=PMP', 'data=KITTI', 'lr=1e-3', 'train_batch_size=2', 'test_batch_size=2', 'sched/lr=NoiseOneCycleCosMo', 'sched.lr.policy.max_momentum=0.90', 'nepoch=30', 'test_epoch=25', '++net.sbn=true']
Error executing job with overrides: ['gpus=[0]', 'num_workers=4', 'name=BP_KITTI', 'net=PMP', 'data=KITTI', 'lr=1e-3', 'train_batch_size=2', 'test_batch_size=2', 'sched/lr=NoiseOneCycleCosMo', 'sched.lr.policy.max_momentum=0.90', 'nepoch=30', 'test_epoch=25', '++net.sbn=true']
Error executing job with overrides: ['gpus=[0]', 'num_workers=4', 'name=BP_KITTI', 'net=PMP', 'data=KITTI', 'lr=1e-3', 'train_batch_size=2', 'test_batch_size=2', 'sched/lr=NoiseOneCycleCosMo', 'sched.lr.policy.max_momentum=0.90', 'nepoch=30', 'test_epoch=25', '++net.sbn=true']
[2024-06-03 15:41:57,958][BP_KITTI][INFO] - device is 0
[2024-06-03 15:41:57,958][BP_KITTI][INFO] - Random Seed: 0001
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 72, in main
    with Trainer(cfg) as run:
  File "train.py", line 72, in main
    with Trainer(cfg) as run:
  File "train.py", line 72, in main
    with Trainer(cfg) as run:
  File "/home/wsy/python_ws/BP-Net/utils.py", line 52, in __init__
    self.cfg.gpu_id = self.cfg.gpus[self.rank]
  File "/home/wsy/python_ws/BP-Net/utils.py", line 52, in __init__
    self.cfg.gpu_id = self.cfg.gpus[self.rank]
  File "/home/wsy/python_ws/BP-Net/utils.py", line 52, in __init__
    self.cfg.gpu_id = self.cfg.gpus[self.rank]
omegaconf.errors.ConfigIndexError: list index out of range
    full_key: gpus[2]
    object_type=list
omegaconf.errors.ConfigIndexError: list index out of range
    full_key: gpus[3]
    object_type=list
omegaconf.errors.ConfigIndexError: list index out of range
    full_key: gpus[1]
    object_type=list

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2024-06-03 15:41:58.034063: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2024-06-03 15:41:58,921][BP_KITTI][INFO] - num_train = 42949, num_test = 500
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 188296 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 188297) of binary: /home/wsy/anaconda3/envs/bevnet/bin/python
Traceback (most recent call last):
  File "/home/wsy/anaconda3/envs/bevnet/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/wsy/anaconda3/envs/bevnet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/wsy/anaconda3/envs/bevnet/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/wsy/anaconda3/envs/bevnet/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/wsy/anaconda3/envs/bevnet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wsy/anaconda3/envs/bevnet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-03_15:42:02
  host      : wsy-OMEN-by-HP-Gaming-Laptop-16-wf0xxx
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 188298)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-03_15:42:02
  host      : wsy-OMEN-by-HP-Gaming-Laptop-16-wf0xxx
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 188299)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-03_15:42:02
  host      : wsy-OMEN-by-HP-Gaming-Laptop-16-wf0xxx
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 188297)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

May I ask how I should solve it, Thanks!

kakaxi314 commented 5 months ago

We can see from your command torchrun --nproc_per_node=4 --master_port 4321 train.py gpus=[0] ... that you run with 4 nodes, but only provide 1 GPU. You should give 4 GPU ids (e.g. gpus=[0,1,2,3]) for 4 nodes. If you only have 1 GPU, you can directly use python train.py gpus=[0] ... to run the code.

stw2021 commented 5 months ago

Thank you very much for your reply! I have solved this problem. But when I test with the checkpoints provided, the result is different from the paper: screenshot-20240603-171507

kakaxi314 commented 5 months ago

Please show me your command.

stw2021 commented 5 months ago

command is HYDRA_FULL_ERROR=1 python test.py gpus=[0] name=BP_KITTI ++chpt=BP_KITTI net=PMP num_workers=4 data=KITTI data.testset.mode=test data.testset.height=352 test_batch_size=1 metric=RMSE ++save=true

kakaxi314 commented 5 months ago

You evaluated the test set. As we couldn't get the ground truth of the test set of KITTI dataset, I simply used the input sparse depth map as the ground truth.

Thus, the displayed RMSE result for test set is meaningless. You need to submit the saved results to KITTI server for fair evaluation.

stw2021 commented 5 months ago

Got it! Thank you very much for your patient reply.

Maple776 commented 1 month ago

Hello, I encountered some problems while setting up the environment. Can I ask for your advice @stw2021

stw2021 commented 3 weeks ago

Hello, I encountered some problems while setting up the environment. Can I ask for your advice @stw2021

I haven't looked at this project for a long time, but I will try my best to reply to your problems.

kakaxi314 / BP-Net

训练过程问题 #7