train.py: error: the following arguments are required: 0

cassandra-t-ye commented 1 year ago

I am trying to train the denoising model with the provided argument on one GPU: python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/SIDD/NAFNet-width64.yml --launcher pytorch

However, I keep getting this error and I'm not sure why this is happening.

usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--input_path INPUT_PATH] [--output_path OUTPUT_PATH] 0 train.py: error: the following arguments are required: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 45617) of binary: /home/gridsan/tye/.conda/envs/naf/bin/python Traceback (most recent call last): File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/gridsan/tye/.conda/envs/naf/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

xl3283 commented 9 months ago

Hi, I have the same issue for train.py, did you fix it?

Hambbuk commented 8 months ago

I have the same issue ...

cassandra-t-ye commented 8 months ago

I used this thread to help me debug: https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242/11

megvii-research / NAFNet

train.py: error: the following arguments are required: 0 #112