How can mpirun directly load the accelerate config.yaml file?

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.92k stars 964 forks source link

How can mpirun directly load the accelerate config.yaml file? #3090

Closed kevinsummer219 closed 2 weeks ago

kevinsummer219 commented 1 month ago

System Info

I have a accelerate config.yaml，I want to submit my training code using mpirun. thanks!

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

Can you give some examples， thanks very much

Expected behavior

Mpirun can load the accelerate config.yaml.

kevinsummer219 commented 1 month ago

I can use the command to train: accelerate launch --config_file config.yaml I want change the command： mpirun ........ accelerate launch --config_file config.yaml How can I modify this command? Or is there a solution? thanks very much!

muellerzr commented 1 month ago

If you follow accelerate config and select CPU, it will give you an option to configure your config.yaml to call mpirun when doing accelerate launch. (when selecting multi-cpu we always do mpirun iirc)

kevinsummer219 commented 1 month ago

Thanks for your reply！ I will use multi-gpu，Is there a solution?Thanks

muellerzr commented 1 month ago

What's the config you're trying to use?

kevinsummer219 commented 1 month ago

I use single node and two nodes， thanks Single node config.yaml： compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Two node config.multi.yaml： compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_process_ip: xx.xx.xxx.xxx main_process_port: 6000 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

kevinsummer219 commented 1 month ago

I am also using accelerate and deepSpeed with a ds_config.yaml, but this parameter cannot be configured in the mpirun runtime environment. Single node ds_config.yaml： compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_config_file: /deepspeed_config/zs1_config.json zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_training_function: main num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

kevinsummer219 commented 1 month ago

I use mpirun train command, ： mpirun --allow-run-as-root -np 8 -H xx.xx.xx.xx:8 -x MASTER_ADDR=xx.xx.xx.xx -x MASTER_PORT=1234 -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib python train.py

however, i want add the accelerate config.yaml to mpirun, how can I modify this command? eg： mpirun --allow-run-as-root -np 8 -H xx.xx.xx.xx:8 -x MASTER_ADDR=xx.xx.xx.xx -x MASTER_PORT=1234 -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib accelerate launch --config_file config.yaml train.py

muellerzr commented 1 month ago

Not sure on that one as I'm not too familiar with mpirun, however for the first one you can manually pass in a mixed_precision to the Accelerator(), for the second you can manually pass in a accelerate.utils.DeepSpeedPlugin to the Accelerator() as well.

kevinsummer219 commented 1 month ago

Not sure on that one as I'm not too familiar with mpirun, however for the first one you can manually pass in a mixed_precision to the Accelerator(), for the second you can manually pass in a accelerate.utils.DeepSpeedPlugin to the Accelerator() as well.

Thanks for your reply！ That is to say, it is currently not possible to directly load the accelerate config.yaml using mpirun, correct? I can manually pass accelerate config to the Accelerator(), can you share more information?

Single node config.yaml： compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

kevinsummer219 commented 1 month ago

Not sure on that one as I'm not too familiar with mpirun, however for the first one you can manually pass in a mixed_precision to the Accelerator(), for the second you can manually pass in a accelerate.utils.DeepSpeedPlugin to the Accelerator() as well.

Can distributed_type and num_processes pass to the Accelerator()? How are these parameters adopted and take effect when submitting tasks with mpirun? Or are these parameters not required to be set by default (except for mixed_precision, gradient_accumulation_steps)? thanks very much!

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.