mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

Explore uploading models to Hugging Face #804

Open eu9ene opened 3 weeks ago

eu9ene commented 3 weeks ago

It will likely require converting them to HF format/Pytorch, similar to how it's done for the OPUS-MT and HPLT models:

https://huggingface.co/Helsinki-NLP/opus-mt-zh-en https://huggingface.co/HPLT/translate-sw-en-v1.0-hplt_opus/tree/main

I also found this converter: https://github.com/huggingface/transformers/blob/main/src/transformers/models/marian/convert_marian_to_pytorch.py

There might be useful code here as well: https://github.com/hplt-project/HPLT-MT-Models/tree/main/v1.0/raw_scripts

gregtatum commented 3 weeks ago

Most of the scripts I looked at don't support the Transformer-RNN structure that we use. Plus we'd have to support the int8shiftAlphaAll mode, which is only in forked Marian. I have some details in: Findings for the Marian to ONNX Investigation

eu9ene commented 3 weeks ago

The HPLT models on HF are 300 Mb in size, so they look more like a Teacher model with transformer-base architecture. We can explore how to upload the student models without conversion and whether they will be usable later somehow. Maybe having proper Python bindings can help to integrate it with HF pipelines.

eu9ene commented 3 weeks ago

Most of the scripts I looked at don't support the Transformer-RNN structure that we use. Plus we'd have to support the int8shiftAlphaAll mode, which is only in forked Marian. I have some details in: Findings for the Marian to ONNX Investigation

Yes, but it's ONNX. I'm sure you can implement everything that's in Marian in Pytorch. It doesn't mean we have to go this way though as it's definitely some work but it would be interesting to explore.

marco-c commented 3 weeks ago

We could start by uploading teacher models.

gregtatum commented 3 weeks ago

Yes, but it's ONNX

The investigation wasn't just for ONNX, as there are links out to other converters. But yes, I agree that if we can get out to other formats it gives us much more options on how we can run these things.