huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.7k stars 938 forks source link

wandb can't exit when training process exit accidentally #3059

Open suixin1424 opened 3 weeks ago

suixin1424 commented 3 weeks ago

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-4.15.0-76-generic-x86_64-with-glibc2.27
- `accelerate` bash location: /home/zhuyiming/anaconda3/envs/mix/bin/accelerate
- Python version: 3.10.0
- Numpy version: 1.24.3
- PyTorch version (GPU?): 1.12.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 503.57 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

  1. starting a training process log with wandb
  2. when the process exit accidentally caused by oom, we can see wandb print output continuously image

Expected behavior

wandb can't exit

hkproj commented 1 week ago

+1

muellerzr commented 1 week ago

Not entirely sure that's something we can do and it's more a wandb issue.