pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.38k stars 427 forks source link

Select a model to train and run on TPUs #7190

Open duncantech opened 1 month ago

duncantech commented 1 month ago

📚 Documentation

Using PyTorch/OpenXLA select a model, and try to get it training and running on Cloud TPUs, create a tutorial on how you went about doing it.

sitamgithub-MSIT commented 1 month ago

/assigntome

sitamgithub-MSIT commented 1 month ago

@duncantech I am thinking of training the latest Gemma model, indeed, with Pytorch XLA. Is it okay then?

JackCaoG commented 1 month ago

I think the geema model should work out of box. Take a look at https://github.com/google/gemma_pytorch#try-it-out-with-pytorchxla. Feel free to give it a try and see if we can improve anything.

sitamgithub-MSIT commented 1 month ago

I think the geema model should work out of box. Take a look at https://github.com/google/gemma_pytorch#try-it-out-with-pytorchxla. Feel free to give it a try and see if we can improve anything.

Ok. I will look into the gemma part.

For a different model I am trying with, a few things I need to know: do I need to use any free cloud tpu provider, for example, Kaggle or Colab tpu, or is it necessary to do it with the v5 in Google Cloud?

JackCaoG commented 1 month ago

That part I think @duncantech can answer.

duncantech commented 1 month ago

You can work with a free TPU provider if you'd like to get things started.

We should also be able to give a small amount of v5es to try with too.

duncantech commented 3 weeks ago

@sitamgithub-MSIT we haven't heard an update in a bit and just wondering if you're still working on the issue?

sitamgithub-MSIT commented 3 weeks ago

@sitamgithub-MSIT we haven't heard an update in a bit and just wondering if you're still working on the issue?

Yes I am working to it. I am checking this example in the hugging face for Gemma. I am thinking about reproducing the same for CodeGemma, though.

sitamgithub-MSIT commented 2 weeks ago

@duncantech I am preparing a script to run in tpus. So as I am using Codegema, it comes with 7b parameters, so it will not fit in Colab unless we use a 4-bit version of that. So should I use the bits and bytes configuration for that? Or should I just train it in the cloud and see if everything works?

duncantech commented 2 weeks ago

You can try with the 4-but version and see what the performance is like since that would be easier for others to run in the future!