Closed cassandra-t-ye closed 8 months ago
Hi, I have the same issue for train.py, did you fix it?
I have the same issue ...
I used this thread to help me debug: https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242/11
I am trying to train the denoising model with the provided argument on one GPU: python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/SIDD/NAFNet-width64.yml --launcher pytorch
However, I keep getting this error and I'm not sure why this is happening.
usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--input_path INPUT_PATH] [--output_path OUTPUT_PATH] 0 train.py: error: the following arguments are required: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 45617) of binary: /home/gridsan/tye/.conda/envs/naf/bin/python Traceback (most recent call last): File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: