Training - Githubissues

nsaadati commented 1 year ago

Hi I'm running the training part and I'm getting this error in this part(sh scripts/train_test_cls.sh), Can you help me please?

/home/exouser/.conda/envs/bdetr3d/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, usage: launch.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE] [--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT] [--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone] [--max_restarts MAX_RESTARTS] [--monitor_interval MONITOR_INTERVAL] [--start_method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no_python] [--run_path] [--log_dir LOG_DIR] [-r REDIRECTS] [-t TEE] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR] [--master_port MASTER_PORT] [--use_env] training_script ... launch.py: error: argument --master_port: invalid int value: 'train_dist_mod.py'

ayushjain1144 commented 1 year ago

Hi @nsaadati ,

We did not face any such issues, maybe make sure you are using the same pytorch versions as we mention in README. (Also try bash scripts/train_test_cls.sh instead of sh scripts/train_test_cls.sh. In one of our machines, we get some bash errors with master_port when we run with sh. But I think incorrect pytorch version is more likely)

nsaadati commented 1 year ago

thanks for your respond, the other problem is the first training line(sh scripts/train_test_det.sh) is running for a week and right now it's running the 198 epoch, is that normal? sh scripts/train_test_det.sh

ayushjain1144 commented 1 year ago

If you are running the model for sr3d (i.e. the original train_test_det.sh) it should have converged to a good number in ~30 epochs (which takes a day). so 198 epochs in a week is normal but you don't need to run for so many.

Also, if you are able to run train_test_det.sh successfully, it's very weird that you are facing issues with train_test_cls.sh.

nsaadati commented 1 year ago

yeah tha's weir for me too, can i kill the code when it's running. i did not change th epoch number i don't know why is running so many times. is that okay to kill it in the middle of running?

ayushjain1144 commented 1 year ago

yes, it saves checkpoints at every 5 epochs, so you can just kill it and evaluate the checkpoint around ~30th epoch. The code is setup to run for a lot of epochs, we usually manually do early stopping when val accuracy starts dropping. if you want it to automatically stop, you can change --max_epochs argument.

ayushjain1144 commented 1 year ago

Closing for now, feel free to reopen it if you face any issues!

nickgkan / butd_detr

Training #24