Open Pringled opened 1 week ago
I would like to benchmark this myself first, but I can agree that the idea that is mentioned here might be nice for a livestream on the probabl channel. I could explore it there and if it works out I can always choose to add it here.
Sounds good! Happy to answer any questions about the library.
@Pringled I guess this is the simplest integration path?
@koaning Either option (via Sentence Transformers or directly with Model2Vec) should be easy to integrate. I think using Model2Vec directly is slightly more flexible since you can call encode to get a mean output and encode_as_sequence to get a sequence output (if you want to support multiple agg
methods like in other supported embedders), and it requires a few less lines of code, e.g.:
from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
vs
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
If you want to use any functionality from Sentence Transformers though then that's definitely the way to go.
I will explore both during the probabl livestream next week. I will then decide afterwards which approach is best. I am also annotating some datasets now just so that I can have a benchmark.
I will also make another comparison; can scikit-learn pipelines with these embeddings beat an LLM?
Cool! Very curious about the results. I'll try to tune in for the livestream.
I will add the livestream link after the current one, which will also be a fun one by the way.
Great, thanks for the link, I'll check it out!
Hi,
I think https://github.com/MinishLab/model2vec might be a good fit for Embetter. It's a static subword embedder that outperforms both GloVE (300d) and BPEmb (50k, 300d) while being much smaller and faster (results are in the repo).
It can be used like this: