Use clearml-serving (which is really the nvidia triton inference server with stuff around it). This will be deployed on a single GPU, such as my RTX3090
We take that deployment (a docker-compose stack) and add one more container to it - a service that will register itself through webhooks to all ClearML deployments. Note that this server could also be configured to just be an echo server for testing. This would reside in Serval
We create an endpoint to register the servicing webhook
We create a new endpoint, not connected to engines that can inference off of the NLLB and in the future, off of an engineID.
We have enough confidence to implement the first stage of this - get clearml-serving working, proxy through webhooks to Serval, test in development
The security model is important - we should really be proxying through Scripture Forge through everything (that is, use a Paratext login). This should be acceptable for SIL Converters as well, as they should be able to authenticate with Paratext and go through a Scripture Forge Proxy.
This security model also allows for live inferencing off of fine tuned models.
This is to support the integration with SIL Converters - and any changes needed to the API. This is dependent on #49.