`accelerate launch` returns exit code `0` on error

salieri commented 1 year ago

System Info

- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
- Python version: 3.10.9
- Numpy version: 1.24.1
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 6
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Information

[X] The official example scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)

Reproduction

Intentionally configure a batch size that is too big for your GPU. E.g.

accelerate launch train_text_to_image.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --dataset_name=lambdalabs/sd-pokemon-diffusers --resolution=512 --train_batch_size=1024
echo $?

# `$?` will be '0'

Expected behavior

On error, CLI commands should return a nonzero exit code.

muellerzr commented 1 year ago

@salieri could you provide a little more information for me please?

I need:

To know what GPU kind you are running on
To know what repo has train_text_to_image.py :) (e.g. is it diffusers?)

Thanks!

SebastianEndrikat commented 1 year ago

A different example that should also exit non-zero:

Singularity> accelerate launch /not_a_file.py; echo "Exit code is: $?"
/anaconda/bin/python3.9: can't open file '/not_a_file.py': [Errno 2] No such file or directory
/anaconda/bin/python3.9: can't open file '/not_a_file.py': [Errno 2] No such file or directory
/anaconda/bin/python3.9: can't open file '/not_a_file.py': [Errno 2] No such file or directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 223625) of binary: /anaconda/bin/python3.9
Exit code is: 0

cat /anaconda/lib/python3.9/site-packages/accelerate/__init__.py | grep version
__version__ = "0.14.0"

salieri commented 1 year ago

@muellerzr Sorry, missed your question! I noticed that on A100s running runpod/pytorch Docker image. But I believe I've seen it on my RTX4090/Windows setup too.

train_text_to_image.py

psobot commented 1 year ago

+1 to this - it looks like the code in launch.py does except Exception, then logs exceptions as they occur but does not propagate the exception upwards:

https://github.com/huggingface/accelerate/blob/b34db0b98743692993288979822f9eabe6098008/src/accelerate/commands/launch.py#L680-L689

huggingface / accelerate