Support for `v1/embeddings` endpoint

josephrocca commented 1 year ago

I'm not sure how feasible or within-scope this is, but it'd be very useful if the Basaran project were able to implement the v1/embeddings endpoint (using Hugging Face repos, like with the v1/completions endpoint).

Text embeddings are very often used alongside the completion endpoints, and we have this particular requirement for OpenCharacters so we can save and search over the character's "memories".

(And very soon we'll have the same requirement for text-to-image. If it were possible for Basaran to aim to be the OpenAI-compatible, open-source API server, that would be awesome.)

peakji commented 1 year ago

Basaran's current goal is to ensure compatibility with both the text completion and chat completion APIs, which actually share the same model. To support embeddings, we would need another (or a set of) model(s), which is actually how OpenAI API does it.

From a architecture perspective, a GPT-like decoder-only is not the best choice for obtaining embeddings. I suggest using options like SentenceTransformers or Universal Sentence Encoder. Therefore, users will need to deploy additional services regardless, so perhaps the embeddings feature is indeed beyond the scope of the Basaran project?

Electrofried commented 1 year ago

Here is the thing, from an end user perspective for this to be a drop in replacement for OpenAI API for it to be considered 'functional', it needs to replicate all main features of the API. I understand that the goal is to only replicate the completion part of the API, however for the vast majority of implementations it will simply make the project useless and require the end user to implement additional measures.

What many (most?) people want is to be able to load up this docker image, and then redirect any application that currently uses OpenAI api and have it 'just work'. I understand it may be outside the scope of the project but I honestly hope you will consider expanding the scope to cover the entire API. Without expanding the scope, the use cases are extremely limited.

Please do not take this input the wrong way, I certainly appreciate the hard work that people have been doing on this project and hope it continues to thrive.

josephrocca commented 1 year ago

To support embeddings, we would need another (or a set of) model(s), which is actually how OpenAI API does it.

Yep, of course.

What many (most?) people want is to be able to load up this docker image, and then redirect any application that currently uses OpenAI api and have it 'just work'.

Yes, this is exactly the use case in OpenCharacters. We need a "one-stop-shop" because the UX for directing users away from the closed ecosystem would otherwise be a 10 step process instead of just "install docker and run this command". It needs to be that simple, and Basaran has been awesome for this so far. It seems like Basaran has enough momentum that it could be the open-source ML server.

Also, I just want to emphasise that the Hugging Face repo based approach is excellent and should be maintained for any other APIs (embedding, text-to-image, etc.)

peakji commented 1 year ago

@Electrofried @josephrocca I completely understand and agree with your point of view! The current difficulty actually comes from the architecture: supporting embeddings requires deploying additional models, which are much smaller than LLM but still require significant resources. As a result, Basaran may no longer be a simple docker image, but rather multiple services behind a router, with increased deployment complexity. For example:

[user]  -->  [basaran-nginx]  -  /v1/completions  ->  [basaran-llm]
                              -  /v1/chat  ->         [basaran-llm]
                              -  /v1/embeddings  ->   [basaran-embedding]

peakji commented 1 year ago

We plan to focus on achieving compatibility with the chat API in the short term, and in the long term, we may start a router project that provides a complete replacement for the OpenAI API, where Basaran is one of the backend. This will also enable model selection using the model parameter.

This could be the beginning of a whole new ecosystem!

josephrocca commented 1 year ago

This sounds perfect! I was actually going to open a separate issue about allowing multiple LLM models behind the same API (i.e. with a single Docker command, but with multiple user/repos specified), and the client uses the model parameter to choose which one. IIUC this would open up the possibility for something like that.

(Maybe even an option for lazily-loaded models so you don't specify MODEL in the docker command - you just make a request, and it loads whatever model you requested on-the-fly. Useful for testing, at the very least.)

Either way, I'm very excited for this - it would be awesome if the open source ML ecosystem had a full, drop-in replacement for OpenAI APIs.

josephrocca commented 1 year ago

Maybe even an option for lazily-loaded models so you don't specify MODEL in the docker command - you just make a request, and it loads whatever model you requested on-the-fly. Useful for testing, at the very least.

@peakji Actually, this might be a really important feature. I can imagine a cloud service (perhaps run by the Basaran project itself as an open-core startup) where I can just tell my OpenCharacters users to swap the api.openai.com URL in their settings for api.basaran.com and then everything works exactly the same, and you just specify the huggingface user/repo as the model parameter. This would be really awesome, because a lot of my users don't have the hardware or the know-how to run Basaran locally, but they still want to play with various open source models.

JacoBezuidenhout commented 1 year ago

+1 for /embeddings :)

josephrocca commented 1 year ago

Related project: https://github.com/closedai-project/closedai They're also intending to add embeddings, image generation, etc. but currently they only support completion and chat completion. Might be room for some collaboration here

hewr1993 commented 1 year ago

How about the following lines added here

def get_embeddings(self, input_ids):
    if input_ids.ndim == 1:
        input_ids = input_ids[None, :]
    outs = self.model.base_model.forward(input_ids, return_dict=False)
    features = outs[0].float()
    return features.mean(dim=1)[0]  # force batch 1

reference to https://github.com/UKPLab/sentence-transformers/blob/214498f/sentence_transformers/SentenceTransformer.py#L809

peakji commented 1 year ago

@hewr1993 It may not be a good idea to use a text generation model to obtain embeddings. I will explain in detail below.

peakji commented 1 year ago

We have decided not to add support for embeddings in Basaran:

Currently, Basaran is designed to serve only one model per process. Considering the capabilities of commodity hardware and the size of the latest models, we believe this is a reasonable design. The primary goal of Basaran is to provide text completion capabilities (and soon-to-be-added chat completion), which generally require decoder-only or encoder-decoder architecture.

However, the best practice for embedding models is to use a Transformer encoder. Therefore, it is challenging to reuse the same model to support both completion and embedding. In fact, the OpenAI embedding model, sometimes referred as GPT-3 embedding, is actually a separate encoder model initialized with the weights of GPT or Codex^[1].

In addition to the limitations imposed by the model structure, we currently cannot achieve full compatibility with OpenAI's embedding API using open source models: OpenAI's latest model, text-embedding-ada-002, has successfully replaced the four previous models used for different scenarios, thereby providing a simple unified embedding API. However, the models in the open-source community are currently not as versatile. They either require different models for symmetric or asymmetric tasks or need specific instructions to adapt to different domains^[2].

Therefore, considering the engineering and research limitations, we have decided not to add support for embeddings in Basaran. In the future, we may initiate a new project that acts as a router to fully support all OpenAI APIs. Multiple Basaran instances (or other inference services) can be mounted to achieve load balancing and model selection at that time.

References:

Neelakantan, Arvind, et al. "Text and code embeddings by contrastive pre-training." arXiv:2201.10005 (2022).
Su, Hongjin, et al. "One Embedder, Any Task: Instruction-Finetuned Text Embeddings." arXiv:2212.09741 (2022).

hyperonym / basaran

Support for `v1/embeddings` endpoint #179