Open johnml1135 opened 11 months ago
Here is a design:
Place to check out best serving apps - we may be able to use something completely standard for serving models: https://github.com/tensorchord/Awesome-LLMOps#large-model-serving
We could use our existing system to do this easily. If there is no training data in the corpus but there is pretranslation data, then we simply skip the fine tuning step in the build job and just perform inferencing.
This is specifically for individual inferencing, not batch. SIL Converters is wondering if we could have a Google Translate like service that would work for the 200 languages. I see this as a thing in itself (workflow: make engine, "build" that wouldn't actually build (see #45), and then adding using the existing "translate" end points for getting a live translation, just as works for SMT today. The work would be to have the GPU host the model to do the live inferencing. This could then be extended to hosting fine tuned models later.
I see. When designing this, we should take into account how we will support live inferencing for fine tuned engines and use a similar approach.
@johnml1135, this hasn't been solved in @mshannon-sil 's recent edits to machine.py having to do with config passing for inference-only jobs, has it?
This would be for SIL Converters