Live inference off of NLLB-200 with GPU

johnml1135 commented 11 months ago

This would be for SIL Converters

johnml1135 commented 11 months ago

Here is a design:

Make a physical server and call it the "inference-GPU." This will be a machine.py docker image that will be able to dynamically spin up and down models for inferencing (Nvidia T4 for 1k?). It will have access to the S3 bucket (and the models for inferencing).
Make a new websocket in Serval - called "register-inference-GPU" or similar. The inference server will register itself with Serval.
When a NMT build is complete in machine, it registers the name of the saved engine in Serval under "inferenceEngine" in the build. It will also place the model in a specific folder in the S3 bucket for using by the inference server.
When Serval recieves a request for a translation, word graph, constrained translation, or train-segment, it will first check if there is an "inferenceEngine" on a completed build. If so, it will pass on the request to the "inference-GPU" that was registered.
The inference-GPU server will process the request as quickly as possible and return it. It make also train the model on small amounts of data.
Small "30 minute" further fine tuning training will be done as jobs in ClearML - and saved separately. When new inference requests come, they will reference the new model name and be pulled from the S3 bucket. There may be a few second delay.

johnml1135 commented 11 months ago

Place to check out best serving apps - we may be able to use something completely standard for serving models: https://github.com/tensorchord/Awesome-LLMOps#large-model-serving

ddaspit commented 11 months ago

We could use our existing system to do this easily. If there is no training data in the corpus but there is pretranslation data, then we simply skip the fine tuning step in the build job and just perform inferencing.

johnml1135 commented 11 months ago

This is specifically for individual inferencing, not batch. SIL Converters is wondering if we could have a Google Translate like service that would work for the 200 languages. I see this as a thing in itself (workflow: make engine, "build" that wouldn't actually build (see #45), and then adding using the existing "translate" end points for getting a live translation, just as works for SMT today. The work would be to have the GPU host the model to do the live inferencing. This could then be extended to hosting fine tuned models later.

ddaspit commented 11 months ago

I see. When designing this, we should take into account how we will support live inferencing for fine tuned engines and use a similar approach.

Enkidu93 commented 8 months ago

@johnml1135, this hasn't been solved in @mshannon-sil 's recent edits to machine.py having to do with config passing for inference-only jobs, has it?

sillsdev / serval

Live inference off of NLLB-200 with GPU #49