Closed alexisraykhel closed 1 year ago
Thanks for opening this @alexisraykhel. Yep, definitely on our roadmap - we just included this in our weekly sprint for the current week.
As a first step, we are planning to support local embedding generation using sentence-transformers (https://www.sbert.net/).
Config addition:
{
"task_name": "ToxicCommentClassification",
"task_type": "classification",
"dataset": {...},
"model": {...},
"embedding": {
"provider": "<string>",
"model": "<string>",
},
"prompt": {...}
Expected behavior:
Hi, I'm also interested in this feature. It looks like this is an active issue, but based on the comment above I tried using PaLM as the embedding provider and it didn't work yet, it still looks for the openai_api_key.
hey @bawinogr, thanks for your comment and interest in this library! Yes, we are working on adding support for other embedding providers - PR #404
If you haven't already, also consider joining our Discord: https://discord.gg/fweVnRx6CU
@alexisraykhel @bawinogr we just merged https://github.com/refuel-ai/autolabel/commit/f33271366c78dea83afd19392cc410144793461f -- this adds support for multiple embedding providers:
sentence-transformers
library: https://github.com/UKPLab/sentence-transformers, with all-mpnet-base-v2
as the default embedding model )We're updating our docs to share a more comprehensive overview, but if you'd like to get started, you can do the following:
semantic_similarity
as few-shot selection technique, you can specify the relevant embedding model parameters (very similar to the "model" key in the config):"embedding": {
"provider": "huggingface_pipeline",
"model": "sentence-transformers/all-mpnet-base-v2"
}
or
"embedding": {
"provider": "google",
"model": "textembedding-gecko@001"
}
@alexisraykhel @bawinogr take a look at https://docs.refuel.ai/guide/llms/embeddings/ for details on embedding model support - thank for you for the suggestions and feedback here!
https://github.com/refuel-ai/autolabel/blob/3eb3e8139028e454c7f51fcdfeba5579b7925228/src/autolabel/few_shot/__init__.py#L48
Currently for few shot learning, if you have any other provider than OpenAI in your config, it will error out because it tries to get embeddings from OpenAI.
Either add the ability to use embeddings from other sources or surface a more informative error message if someone tries to use this feature without OpenAI.