refuel-ai / autolabel

Label, clean and enrich text datasets with LLMs.
https://docs.refuel.ai/
MIT License
2.07k stars 144 forks source link

Use embeddings from other LLMs besides OpenAI for fewshot learning #370

Closed alexisraykhel closed 1 year ago

alexisraykhel commented 1 year ago

https://github.com/refuel-ai/autolabel/blob/3eb3e8139028e454c7f51fcdfeba5579b7925228/src/autolabel/few_shot/__init__.py#L48

Currently for few shot learning, if you have any other provider than OpenAI in your config, it will error out because it tries to get embeddings from OpenAI.

Either add the ability to use embeddings from other sources or surface a more informative error message if someone tries to use this feature without OpenAI.

nihit commented 1 year ago

Thanks for opening this @alexisraykhel. Yep, definitely on our roadmap - we just included this in our weekly sprint for the current week.

As a first step, we are planning to support local embedding generation using sentence-transformers (https://www.sbert.net/).

nihit commented 1 year ago

Config addition:

  1. Embedding model related specification will be part of a new "embedding" top-level key.
  2. For now, this key can have two keys underneath - provider and model.
    {
    "task_name": "ToxicCommentClassification",
    "task_type": "classification",
    "dataset": {...},
    "model": {...},
    "embedding": {
        "provider": "<string>",
        "model": "<string>",
    },
    "prompt": {...}

Expected behavior:

  1. The "embedding" key is optional. If it is explicitly specified, we will use the embedding model specified in there.
  2. If it is not specified, by default, the embedding model we use should be the one from the LLM provider
  3. If the LLM provider does not provide any embedding model (e.g. Anthropic, Refuel LLM), we can use OpenAI as the default embedding model
bawinogr commented 1 year ago

Hi, I'm also interested in this feature. It looks like this is an active issue, but based on the comment above I tried using PaLM as the embedding provider and it didn't work yet, it still looks for the openai_api_key.

nihit commented 1 year ago

hey @bawinogr, thanks for your comment and interest in this library! Yes, we are working on adding support for other embedding providers - PR #404

If you haven't already, also consider joining our Discord: https://discord.gg/fweVnRx6CU

nihit commented 1 year ago

@alexisraykhel @bawinogr we just merged https://github.com/refuel-ai/autolabel/commit/f33271366c78dea83afd19392cc410144793461f -- this adds support for multiple embedding providers:

  1. OpenAI (text-embedding-ada-002)
  2. Google (textembedding-gecko@001)
  3. Huggingface (internally this uses the sentence-transformers library: https://github.com/UKPLab/sentence-transformers, with all-mpnet-base-v2 as the default embedding model )

We're updating our docs to share a more comprehensive overview, but if you'd like to get started, you can do the following:

  1. git clone and pull the latest main from https://github.com/refuel-ai/autolabel
  2. pip install ".[all]" (this will download and install the library with all the extras)
  3. If you're using semantic_similarity as few-shot selection technique, you can specify the relevant embedding model parameters (very similar to the "model" key in the config):
"embedding": {
         "provider": "huggingface_pipeline",
         "model": "sentence-transformers/all-mpnet-base-v2"
}

or

"embedding": {
         "provider": "google",
         "model": "textembedding-gecko@001"
}
nihit commented 1 year ago

@alexisraykhel @bawinogr take a look at https://docs.refuel.ai/guide/llms/embeddings/ for details on embedding model support - thank for you for the suggestions and feedback here!