substratusai / images

Official Substratus Container Images
1 stars 0 forks source link

Create model transformation and serving images for efficient serving #18

Open brandonjbjelland opened 1 year ago

brandonjbjelland commented 1 year ago

Description

CTranslate2, llama.cpp, llama2.c, and similar are incredibly useful for inferencing on lesser-GPUs or even CPUs in many cases. In particular, we could use these derivative models:

  1. For local use cases where an older GPU exists or no GPU at all.
  2. As a speed and spend optimization when running models on a cloud provider.

For the models that support these transformations, we get:

Proposal - Dedicated translation Model Images

Objective

Create a dedicated model image that is designed to run against models that are already established in substratus. This image will support a known list of models or model architectures and will enable efficient execution on lesser resources.

Implementation Steps

  1. Model Selection: Identify and select the translation techniques and supported models that are most relevant and commonly used within the community.
  2. Conversion Process: Develop a conversion process (similar to a model loader image) that takes the existing model (either a base model loaded from HF or a fine-tuned model) and creates a new, optimized model suitable for running on lesser-GPUs or CPUs.
  3. Make these derivative models easy to identify: We should have an easy way to identify models that have been translated since downstream usage of them within servers will differ.

A potential path: we could roll all these (llama.cpp, llama2.c, CTranslate2, et al.) tools into a single model-transformer image that is able to automatically detect for a given base model, how to translate it or if translation is even possible.

Pairing with Model Server Images

Create model server images that can run models that have been converted through the aforementioned processes. This ensures that the models can be deployed and run in various environments, such as local machines or cloud providers.

Implementation Steps

  1. Server Selection: Either some or none of these models will be able to be served through basaran. We could build a server on fastapi to emit server-sent events. Potentially this already exists - unsure. I think it'd be good if we continued our path of supporting the OpenAI API here (e.g., v1/completions). The UI is a nice to have. A potential project that can help with llama: https://github.com/abetlen/llama-cpp-python
  2. Make the work available: Publish these images on dockerhub to make the server images available.

Closing thoughts

The proposal aims to leverage innovative solutions like CTranslate2, llama.cpp, and llama2.c to make large language models more accessible and efficient. By developing dedicated images and pairing with server images, we can provide robust and flexible solutions that cater to a wide range of use cases and hardware specifications. This approach aligns with the goals of promoting inclusivity, efficiency, and innovation in the AI community.

Replaces: https://github.com/substratusai/model-falcon-40b/issues/2, https://github.com/substratusai/model-falcon-40b/issues/1