huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Support for Torch XLA Dynamo Backend #2870

Closed johnsutor closed 14 minutes ago

johnsutor commented 2 weeks ago

System Info

litepod 8-core single host TPU VM running Ubuntu 22.04
Accelerate 0.31.0

Information

Tasks

Reproduction

I would like to use the torchxla_trace_once backend as discussed here. However, when I attempt to run the official training scripts for causal language modeling using the following

accelerate launch --tpu --num_processes 8 \
--main_training_function main \
--mixed_precision bf16 \
--dynamo_backend aot_torchxla_trace_once \
training/src/run_clm.py \
--from_huggingface \
....

it does not list aot_torchxla_trace_once as a supported backend

accelerate <command> [<args>] launch: error: argument --dynamo_backend/--dynamo-backend: invalid choice: 'aot_torchxla_trace_once' (choose from 'no', 'eager', 'aot_eager', 'inductor', 'aot_ts_nvfuser', 'nvprims_nvfuser', 'cudagraphs', 'ofi', 'fx2trt', 'onnxrt', 'tensorrt', 'ipex', 'tvm')

I would not mind contributing such a feature if this is something that is desired in accelerate or is on the roadmap.

Expected behavior

Support for the aot_torchxla_trace_once XLA backend.

SunMarc commented 1 week ago

Hi @johnsutor, thanks for raising the issue. Feel free to submit a PR to add this new backend here and here after testing it !