Open Bachstelze opened 1 year ago
Is this something currently being worked on? If not, I would love to contribute.
In the long-term, I am looking into better translation support by LLMs like the tower of unbabel. Though it takes additional steps till we have general models with this enhancement.
@Rohith04MVK This is not actively being worked on, but if folks want this I'm happy for it to be added. I haven't thought about this deeply, but I would imagine this could be implemented as something like:
def translate(text, src_lang, dst_lang):
"""Translate `text` from `src_lang` to `dst_lang`"""
...
It should be a lot like the code function.
I'd love to help! While I think M2M-100 418M model with CTranslate2 (>512 MB) has potential, are there any other models or approaches we should consider before moving forward?
My approach has been to try to define the simplest possible interface without worrying too much about specific models. New and improved models are created regularly, and one of my goals for this project is to provide easy access to the current state-of-the-art model for its size without users of the package needing to keep track of the latest and greatest models.
There's a priority list of available models that is used to determine which model to use. The package searches through this list in order until a model is found the matches the current inference requirement (max RAM, license, tuning, etc). I would hope that we would be able to do the same for translation
models.
m2m100 looks like a reasonable place to start from my point of view. I just uploaded the ct2 int8 quantized models.
NLLB models are also supported by cTranslate. They support up to 200 languages but are a magnitude bigger.
Was the sentencepiece.bpe.model
intentionally omitted from the repo?
That's an oversight on my part. I have a notebook that I use to quickly convert these models. I didn't see that this file needed to be added to the files copied by ct2-transformers-converter. I've added those files now.
We could support more translation directions with m2m-100 in cTranslate or use easy translate.