huggingface / Google-Cloud-Containers

Including Hugging Face Deep learning Containers for Google Cloud
Apache License 2.0
112 stars 10 forks source link

Add example with PyTorch_XLA TPU DLC #17

Closed shub-kris closed 6 months ago

shub-kris commented 7 months ago

This PR adds an example for our PyTorch TPU container. The README will be updated later once the DLCs are released. For now it mentions the steps that I followed to build and test it.

The example is training BERT for emotion classification. This example is based on pytorch-xla test

shub-kris commented 6 months ago

I have added TRL, PEFT and used Dolly-15k.

With the setup mentioned in README, I was able to run the training in 2 minutes and 30 seconds.

cd /workspace
python google-partnership/Google-Cloud-Containers/examples/google-cloud-tpu-vm/causal-language-modeling/peft_lora_trl_dolly_clm.py \ 
--model_id facebook/opt-350m \
--num_epochs 3 \
--train_batch_size 8 \
--num_cores 8 \
--lr 3e-4

@philschmid running with Llama 7B will require a bigger machine and I am testing that currently as with TPU: v5-litepod-8 runs OOM.

So for now, we can merge this PR along with the Dockerfile mentioned in other PR: https://github.com/huggingface/Google-Cloud-Containers/pull/14

I will open a separate PR where I would add an example to work with LLama-7B as it requires setting up a VM with multiple hosts: v5-litepod-16 and for that the steps to execute is different.

shub-kris commented 6 months ago

Merged into PR #14