open-mmlab / mmengine

OpenMMLab Foundational Library for Training Deep Learning Models
https://mmengine.readthedocs.io/
Apache License 2.0
1.09k stars 329 forks source link

[Feature] Trainium support #777

Open austinmw opened 1 year ago

austinmw commented 1 year ago

What is the feature?

Support mmlab training on the AWS Trainium device

Any other context?

I saw in https://github.com/open-mmlab/mmengine/issues/524 that TPU support is planned, so thought it would make sense to also support AWS's latest AI chips

C1rN09 commented 1 year ago

Hi, @austinmw ! I'll look into it recently. Generally we are willing to support more devices, but we don't have access to Trainium. Are you willing to post a PR on this feature, so that we can review it? If you have any questions on it, we can have a discussion whether in this issue or in our discussion forum.

C1rN09 commented 1 year ago

I looked through the links. Seems like Trainium also needs pytorch_xla. We are planing to add this feature, but it may not be that soon because we have some other things of higher priority. However, we always encourage community contributions on it :smile:

austinmw commented 1 year ago

Thanks for your responses! To be honest I don’t have experience with xla, but I can reserve an instance and take some time to see if I can get it working.