Closed jayxsinha closed 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@jayxsinha have the same error
raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port 29500
, but another process is utilizing it. Please
specify a different port (such as using the --main_process_port
flag or specifying a different main_process_port
i
n your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this
to 0
.
trying to run with
accelerate launch .\acc_torch\multi_gpu_accelerate.py --main_process_port 29600
or even make the --main_process_port=0 or equal =29000 but the same error
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Please review Sweep YAML and Accelerate config yaml to get a look on this.
Expected behavior
accelerate_2_bf16 .txt - Accelerate Config file slurm_23908547-4.txt - Log from the job sweep copy.txt - Sweep YAML
As pointed out in the exception: ConnectionError: Tried to launch distributed communication on port
29144
, but another process is utilizing it. Please specify a different port (such as using the--main_process_port
flag or specifying a differentmain_process_port
in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to0
.I set the main_process_port to be 0 but it shows error: