sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Live inference off of NLLB-200 with GPU #49

Open johnml1135 opened 11 months ago

johnml1135 commented 11 months ago

This would be for SIL Converters

johnml1135 commented 11 months ago

Here is a design:

johnml1135 commented 11 months ago

Place to check out best serving apps - we may be able to use something completely standard for serving models: https://github.com/tensorchord/Awesome-LLMOps#large-model-serving

ddaspit commented 11 months ago

We could use our existing system to do this easily. If there is no training data in the corpus but there is pretranslation data, then we simply skip the fine tuning step in the build job and just perform inferencing.

johnml1135 commented 11 months ago

This is specifically for individual inferencing, not batch. SIL Converters is wondering if we could have a Google Translate like service that would work for the 200 languages. I see this as a thing in itself (workflow: make engine, "build" that wouldn't actually build (see #45), and then adding using the existing "translate" end points for getting a live translation, just as works for SMT today. The work would be to have the GPU host the model to do the live inferencing. This could then be extended to hosting fine tuned models later.

ddaspit commented 11 months ago

I see. When designing this, we should take into account how we will support live inferencing for fine tuned engines and use a similar approach.

Enkidu93 commented 8 months ago

@johnml1135, this hasn't been solved in @mshannon-sil 's recent edits to machine.py having to do with config passing for inference-only jobs, has it?