huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.63k stars 926 forks source link

Can torchrun capture the contents of accelerate config #2519

Closed jianxing1 closed 4 months ago

jianxing1 commented 6 months ago

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-5.10.0-20-amd64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0a0+6a974be (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 377.80 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 2
        - gpu_ids: all
        - main_process_ip: 172.16.128.141
        - main_process_port: 2024
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

Hello,

When I ran the official routine nlp_example.py, the README.md mentioned that it can be run through accelerate launch and torchrun.

After I went through accelerate config and set the mixed precision to bf16, I ran accelerate launch and it printed that the precision was indeed bf16.

But when I run torchrun, its mixed precision is no.

Does torchrun support automatic reading the yaml content of accelerate config?

Or how to enable torchrun to use the configuration information of accelerate config?

This is my accelerate config process:

accelerate config
-------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                   
-------------------------------------------------------------------------------Which type of machine are you using?                                           
multi-GPU                                                                      
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2                                                                    
-------------------------------------------------------------------------------What is the rank of this machine?                                              
0                                                                              
What is the IP address of the machine that will host the main process? 172.16.128.141                                                                         
What is the port you will use to communicate with the main process? 29500
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
-------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
bf16                       

Expected behavior

I hope that the torchrun command can obtain the parameters in the accelerate config, including mixed precision, etc.

muellerzr commented 6 months ago

No, it only handles spawning the process. You’d need to pass the other arguments into the Accelerator using the related classes and arguments (DeepspeedPlugin for deepspeed, mixed_precision=“bf16” for your case here)

muellerzr commented 6 months ago

Is there a reason to not just use accelerate launch?

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.