huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.78k stars 942 forks source link

accelerate launch stuck forever #3073

Open wd255 opened 1 month ago

wd255 commented 1 month ago

System Info

Running accelerate launch on a linux server with 10 4090 cards. Env details:
- `Accelerate` version: 0.33.0
- Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/anaconda3/envs/new_magvit/bin/accelerate
- Python version: 3.12.4
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 629.62 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: True
        - num_processes: 10
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3,4,5,6,7,8,9
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

  1. conda create -n my_env
  2. conda activate my_env
  3. conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
  4. conda install -c conda-forge accelerate
  5. run accelerate launch test.py. test.py could be an existing or non-existing file, the command halts before reach where the file matters. That being said, the test.py I used is
    
    import torch
    import torch.nn as nn
    from accelerate import Accelerator

if name == "main": accelerator = Accelerator() model = nn.Conv2d(10, 20, 3, 1, 1) model = accelerator.prepare(model)



### Expected behavior

Expected behavior is if test.py exists, it gets executed.
The actual behavior is the program halts forever after running accelerate launch, without printing anything. Ctrl-C cannot kill it, I have to Ctrl-Z. Then there's still something running on port 29500 so next time I have to kill it manually.
github-actions[bot] commented 23 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.