Train.py - Githubissues

ZshahRA commented 4 years ago

Hello all, I would like to run train.py using python -m torch.distributed.launch --nproc_per_node=4 --master_porp=8890 train.py --batch 4 [ADE20K PATH]. However, I get the following error. Could you please provide me your suggestions.

usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK] [--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR] [--master_port MASTER_PORT] [--use_env] [-m] training_script ... launch.py: error: unrecognized arguments: --master_porp=8890

rosinality commented 4 years ago

Sorry, that was typo in README.md. Correct argument is python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 [ADE20K PATH]

ZshahRA commented 4 years ago

ocr-pytorch-master> python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 [ADE20K PATH]

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: PATH] train.py: error: unrecognized arguments: PATH] train.py: error: unrecognized arguments: PATH] usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: PATH] Traceback (most recent call last): File "D:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\Anaconda\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 253, in Hello Rosinality, Thank you for your prompt response. I tried again using your new suggestion. Could you please check the following error. Your response would be highly appreciated.

main() File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['D:\Anaconda\python.exe', '-u', 'train.py', '--local_rank=3', '--batch', '4', '[ADE20K', 'PATH]']' returned non-zero exit status 2. PS D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master>

rosinality commented 4 years ago

You should set [ADE20K PATH] to your dataset path.

ZshahRA commented 4 years ago

Hi rosinality, Did you mean setting the path in Datset.py? If yes then where? could you please provide more detail? or you mean to set like this? python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 [D:\PD_Rt\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1] i tried this one and it shows the same error.

train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1] train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1] usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1] usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1] Traceback (most recent call last): File "D:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\Anaconda\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 253, in main() File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['D:\Anaconda\python.exe', '-u', 'train.py', '--local_rank=3', '--batch', '4', '[D:\PhD_Rahmat\OCR', 'code\ocr-pytorch-master\ocr-pytorch-master\data1]']' returned non-zero exit status 2.

rosinality commented 4 years ago

You can use like this:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 D:\PD_Rt\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1

ZshahRA commented 4 years ago

Hello rosinality, I tried this one as you suggested.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1

However, I get the following error. could you please share your suggestions. Thanks in advance. PS D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master> python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1\images

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1\images usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1\images usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1\images usage: train.py [-h] [--epoch EPOCH] [--batch BATCH] [--size SIZE] [--arch ARCH] [--aux_weight AUX_WEIGHT] [--n_class N_CLASS] [--lr LR] [--l2 L2] [--local_rank LOCAL_RANK] PATH train.py: error: unrecognized arguments: code\ocr-pytorch-master\ocr-pytorch-master\data1\images Traceback (most recent call last): File "D:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\Anaconda\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 253, in main() File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['D:\Anaconda\python.exe', '-u', 'train.py', '--local_rank=3', '--batch', '4', 'D:\PhD_Rahmat\OCR', 'code\ocr-pytorch-master\ocr-pytorch-master\data1\images']' returned non-zero exit status 2.

rosinality commented 4 years ago

Ah, I didn't saw that spaces was in the paths. You can use this: python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 "D:\PD_Rt\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1"

ZshahRA commented 4 years ago

Hello rosinality, Thanks for your suggestions. It seems this one would work. python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 "D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1"

But i guess my GPU version is very old. I get the following error. Any suggestions? PS D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master> python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 "D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1"

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "train.py", line 148, in File "train.py", line 148, in File "train.py", line 148, in File "train.py", line 148, in torch.cuda.set_device(args.local_rank)torch.cuda.set_device(args.local_rank)torch.cuda.set_device(args.local_rank)torch.cuda.set_device(args.local_rank)

File "D:\Anaconda\lib\site-packages\torch\cuda__init.py", line 300, in set_device File "D:\Anaconda\lib\site-packages\torch\cuda__init.py", line 300, in set_device File "D:\Anaconda\lib\site-packages\torch\cuda__init__.py", line 300, in set_device File "D:\Anaconda\lib\site-packages\torch\cuda\init__.py", line 300, in set_device torch._C._cuda_setDevice(device) torch._C._cuda_setDevice(device) File "D:\Anaconda\lib\site-packages\torch\cuda\init__.py", line 192, in _lazy_init torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

  File "D:\Anaconda\lib\site-packages\torch\cuda\__init__.py", line 192, in _lazy_init

File "D:\Anaconda\lib\site-packages\torch\cuda__init.py", line 192, in _lazy_init File "D:\Anaconda\lib\site-packages\torch\cuda\init__.py", line 192, in _lazy_init _check_driver()_check_driver()_check_driver()

_check_driver() File "D:\Anaconda\lib\site-packages\torch\cuda__init.py", line 111, in _check_driver File "D:\Anaconda\lib\site-packages\torch\cuda\init.py", line 111, in _check_driver File "D:\Anaconda\lib\site-packages\torch\cuda\init__.py", line 111, in _check_driver

      File "D:\Anaconda\lib\site-packages\torch\cuda\__init__.py", line 111, in _check_driver
of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))

of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))AssertionErrorAssertionErrorAssertionError : : : AssertionError The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.:

The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. Traceback (most recent call last): File "D:\Anaconda\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "D:\Anaconda\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 253, in main() File "D:\Anaconda\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['D:\Anaconda\python.exe', '-u', 'train.py', '--local_rank=3', '--batch', '4', 'D:\PhD_Rahmat\OCR code\ocr-pytorch-master\ocr-pytorch-master\data1']' returned non-zero exit status 1.

rosinality commented 4 years ago

You could either upgrade your GPU drivers or downgrad PyTorch. But I think this code will not work in older PyTorch without many modifications...

ZshahRA commented 4 years ago

Thank you so much rosinality, I will use my another GPU computer soon.

ZshahRA commented 4 years ago

Hello rosinality, I am using another GPU now. Could you please check the following error again. Thanks]

(alir) D:\ocr-pytorch-master>python -m torch.distributed.launch --nproc_per_node=4 --master_port=8890 train.py --batch 4 "D:\ocr-pytorch-master\data1s"

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

THCudaCheck FAIL file=..\torch\csrc\cuda\Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "train.py", line 148, in torch.cuda.set_device(args.local_rank) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\cuda__init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at ..\torch\csrc\cuda\Module.cpp:37 THCudaCheck FAIL file=..\torch\csrc\cuda\Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "train.py", line 148, in torch.cuda.set_device(args.local_rank) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\cuda__init__.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at ..\torch\csrc\cuda\Module.cpp:37 Traceback (most recent call last): File "train.py", line 149, in torch.distributed.init_process_group(backend='nccl', init_method='env://') AttributeError: module 'torch.distributed' has no attribute 'init_process_group' THCudaCheck FAIL file=..\torch\csrc\cuda\Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "train.py", line 148, in torch.cuda.set_device(args.local_rank) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\cuda\init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at ..\torch\csrc\cuda\Module.cpp:37 Traceback (most recent call last): File "C:\Users\alir3459.conda\envs\alir\lib\runpy.py", line 193, in _run_module_as_main "main__", mod_spec) File "C:\Users\alir3459.conda\envs\alir\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\distributed\launch.py", line 253, in main() File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\alir3459\.conda\envs\alir\python.exe', '-u', 'train.py', '--local_rank=3', '--batch', '4', 'D:\ocr-pytorch-master\data1s']' returned non-zero exit status 1.

rosinality commented 4 years ago

By using --nproc_per_node=4 it requires 4 GPUs. I think that could be the problem.

ZshahRA commented 4 years ago

Hello, I have 2 GPUs, i tried the following one. But again the same error. Is there any other place where i have to change. python -m torch.distributed.launch --nproc_per_node=2 --master_port=8890 train.py --batch 2 "D:\ocr-pytorch-master\data1\train" Thanks

(alir) D:\ocr-pytorch-master>python -m torch.distributed.launch --nproc_per_node=2 --master_port=8890 train.py --batch 2 "D:\ocr-pytorch-master\data1\train"

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

TTraceback (most recent call last): H File "train.py", line 149, in CudaCheck FAtorch.distributed.init_process_group(backend='nccl', init_method='env://')I LAttributeError : fmodule 'torch.distributed' has no attribute 'init_process_group'ile= ..\torch\csrc\cuda\Module.cpp line=37 error=101 : invalid device ordinal Traceback (most recent call last): File "train.py", line 148, in torch.cuda.set_device(args.local_rank) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\cuda__init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at ..\torch\csrc\cuda\Module.cpp:37 Traceback (most recent call last): File "C:\Users\alir3459.conda\envs\alir\lib\runpy.py", line 193, in _run_module_as_main "main__", mod_spec) File "C:\Users\alir3459.conda\envs\alir\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\distributed\launch.py", line 253, in main() File "C:\Users\alir3459.conda\envs\alir\lib\site-packages\torch\distributed\launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\alir3459\.conda\envs\alir\python.exe', '-u', 'train.py', '--local_rank=1', '--batch', '2', 'D:\ocr-pytorch-master\data1\train']' returned non-zero exit status 1.

rosinality commented 4 years ago

I found that pytorch distributed only supports linux. https://pytorch.org/docs/stable/distributed.html Sorry.

Maybe you could use DataParallel, but you will need to change sync batch norm.

ZshahRA commented 4 years ago

Thank you rosinality for your help. I will try to find where I need to change. However, I will possibly use Linux next week. Regards

welleast commented 4 years ago

The reproduction of OCR with HRNet is available at https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/HRNet-OCR.

R1234A commented 3 years ago

Hi,

I am using the command for training as :

python train.py -m torch.distributed.launch --nproc_per_node 2 --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5x.pt

But I am getting the error as :

train.py: error: unrecognized arguments: -m torch.distributed.launch --nproc_per_node 2

Can you please help on this, thanxx in advance.

rosinality commented 3 years ago

@R1234A You need to use python -m torch.distributed.launch train.py.

rosinality / ocr-pytorch

Train.py #1