Create model transformation and serving images for efficient serving

Description

CTranslate2, llama.cpp, llama2.c, and similar are incredibly useful for inferencing on lesser-GPUs or even CPUs in many cases. In particular, we could use these derivative models:

For local use cases where an older GPU exists or no GPU at all.
As a speed and spend optimization when running models on a cloud provider.

For the models that support these transformations, we get:

Improved Resource Efficiency: Enables the usage of large language models on devices with fewer resources, expanding capable models to workstation-local use cases (kind).
Cost Reduction: Can significantly reduce the costs associated with running models on cloud providers. This helps our OSS users, customers, and us as we spin up the SaaS.
Speed Enhancement: Both inferencing and the time to inference are improved if we're able to serve through transformed models.

Proposal - Dedicated translation Model Images

Objective

Create a dedicated model image that is designed to run against models that are already established in substratus. This image will support a known list of models or model architectures and will enable efficient execution on lesser resources.

Implementation Steps

Model Selection: Identify and select the translation techniques and supported models that are most relevant and commonly used within the community.
Conversion Process: Develop a conversion process (similar to a model loader image) that takes the existing model (either a base model loaded from HF or a fine-tuned model) and creates a new, optimized model suitable for running on lesser-GPUs or CPUs.
Make these derivative models easy to identify: We should have an easy way to identify models that have been translated since downstream usage of them within servers will differ.

A potential path: we could roll all these (llama.cpp, llama2.c, CTranslate2, et al.) tools into a single model-transformer image that is able to automatically detect for a given base model, how to translate it or if translation is even possible.

Pairing with Model Server Images

Create model server images that can run models that have been converted through the aforementioned processes. This ensures that the models can be deployed and run in various environments, such as local machines or cloud providers.

Implementation Steps

Server Selection: Either some or none of these models will be able to be served through basaran. We could build a server on fastapi to emit server-sent events. Potentially this already exists - unsure. I think it'd be good if we continued our path of supporting the OpenAI API here (e.g., v1/completions). The UI is a nice to have. A potential project that can help with llama: https://github.com/abetlen/llama-cpp-python
Make the work available: Publish these images on dockerhub to make the server images available.

Closing thoughts

The proposal aims to leverage innovative solutions like CTranslate2, llama.cpp, and llama2.c to make large language models more accessible and efficient. By developing dedicated images and pairing with server images, we can provide robust and flexible solutions that cater to a wide range of use cases and hardware specifications. This approach aligns with the goals of promoting inclusivity, efficiency, and innovation in the AI community.

Replaces: https://github.com/substratusai/model-falcon-40b/issues/2, https://github.com/substratusai/model-falcon-40b/issues/1

substratusai / images