Support 100 translation languages with m2m-100

jncraton / languagemodels

Explore large language models in 512MB of RAM

https://jncraton.github.io/languagemodels/

MIT License

1.18k stars 78 forks source link

Support 100 translation languages with m2m-100 #5

Open Bachstelze opened 1 year ago

Bachstelze commented 1 year ago

We could support more translation directions with m2m-100 in cTranslate or use easy translate.

Rohith04MVK commented 9 months ago

Is this something currently being worked on? If not, I would love to contribute.

Bachstelze commented 9 months ago

In the long-term, I am looking into better translation support by LLMs like the tower of unbabel. Though it takes additional steps till we have general models with this enhancement.

jncraton commented 9 months ago

@Rohith04MVK This is not actively being worked on, but if folks want this I'm happy for it to be added. I haven't thought about this deeply, but I would imagine this could be implemented as something like:

def translate(text, src_lang, dst_lang):
    """Translate `text` from `src_lang` to `dst_lang`"""
    ...

It should be a lot like the code function.

Rohith04MVK commented 9 months ago

I'd love to help! While I think M2M-100 418M model with CTranslate2 (>512 MB) has potential, are there any other models or approaches we should consider before moving forward?

jncraton commented 9 months ago

My approach has been to try to define the simplest possible interface without worrying too much about specific models. New and improved models are created regularly, and one of my goals for this project is to provide easy access to the current state-of-the-art model for its size without users of the package needing to keep track of the latest and greatest models.

There's a priority list of available models that is used to determine which model to use. The package searches through this list in order until a model is found the matches the current inference requirement (max RAM, license, tuning, etc). I would hope that we would be able to do the same for translation models.

m2m100 looks like a reasonable place to start from my point of view. I just uploaded the ct2 int8 quantized models.

Bachstelze commented 9 months ago

NLLB models are also supported by cTranslate. They support up to 200 languages but are a magnitude bigger.

Rohith04MVK commented 8 months ago

Was the sentencepiece.bpe.model intentionally omitted from the repo?

jncraton commented 8 months ago

That's an oversight on my part. I have a notebook that I use to quickly convert these models. I didn't see that this file needed to be added to the files copied by ct2-transformers-converter. I've added those files now.