Closed PJ-Finlay closed 3 years ago
Hey! Cool!
Do you plan on binding to the python library for CTranslate2, or will you just run a binary with std
's Command
? And if you turn this into a crate how would you make this build reproducible? Looking at CTranslate2 it says that models are over 100MB, which is WAY too big if you were ever to make it into a crate on crates.io, plus if you were to get the python library of ctranslate2
, why not use the argostranslate
python library? It has sentence edge detection and is much easier to use with packaged language models.
Thanks for making me aware of this , I can see it's trying to do something similar to what libretranslate-rs , but I can see we have very clearly different goals.
libretranslate-rs was designed as an http rest client instead of a binding to argostranslate or ctranslate2 because of the weird abstractions between python libraries and rust, ON TOP of the fact that they have large file sizes not ready for crates.io.
Though if you find out how to solve those issues I'd LOVE to see how you figure this out, and where you go with it.
This is different than libretranslate-rs. I'm assuming for libretranslate-rs you want to call the API?
This definitely isn't production ready and no immediate plans to try to make a crate.
If you want to do full translations with sentence boundary detection and tokenization then binding to the Argos Translate Python bindings is probably a much better strategy. This is just an experiment with calling CTranslate2 directly.
Okay okay sounds great.
If you wanted to do translation without a Python interpreter you would want to directly bind to CTranslate2 and SentencePiece.
There's functionality to use a CTranslate2 seq2seq model for sentence boundary detection without Stanza that you would probably need to recreate.
Thanks! I'll look into this.
I think I'll try to do this soon, but in a different repository under a different name.
Please post once you start! I don't have much Rust experience but would be curious.
My repo was to learn Rust and isn't intended to be production ready, I bind to CTranslate through the CLI. I think to do this correctly you would want to connect to the SentencePiece/CTranslate2 code through the C interface and be able to build both of them using Rust's package system.
If you're still planning to do this looks like you don't even need a tokenizer:
Let me know if you need a CTranslate model trained without tokenization.
Currently can run inference on a CTranslate2 model.