koaning / embetter

just a bunch of useful embeddings
https://koaning.github.io/embetter/
MIT License
461 stars 15 forks source link

Add Model2Vec support #110

Open Pringled opened 1 week ago

Pringled commented 1 week ago

Hi,

I think https://github.com/MinishLab/model2vec might be a good fit for Embetter. It's a static subword embedder that outperforms both GloVE (300d) and BPEmb (50k, 300d) while being much smaller and faster (results are in the repo).

It can be used like this:

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
model_name = "minishlab/M2V_base_output"
model = StaticModel.from_pretrained(model_name)

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
koaning commented 1 week ago

I would like to benchmark this myself first, but I can agree that the idea that is mentioned here might be nice for a livestream on the probabl channel. I could explore it there and if it works out I can always choose to add it here.

Pringled commented 1 week ago

Sounds good! Happy to answer any questions about the library.

koaning commented 16 hours ago

@Pringled I guess this is the simplest integration path?

https://www.linkedin.com/posts/minish-lab_big-news-model2vec-is-now-officially-integrated-activity-7250399345320103936-JrqY?utm_source=share&utm_medium=member_desktop

Pringled commented 16 hours ago

@koaning Either option (via Sentence Transformers or directly with Model2Vec) should be easy to integrate. I think using Model2Vec directly is slightly more flexible since you can call encode to get a mean output and encode_as_sequence to get a sequence output (if you want to support multiple agg methods like in other supported embedders), and it requires a few less lines of code, e.g.:

from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/M2V_base_output")
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

vs

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

If you want to use any functionality from Sentence Transformers though then that's definitely the way to go.

koaning commented 16 hours ago

I will explore both during the probabl livestream next week. I will then decide afterwards which approach is best. I am also annotating some datasets now just so that I can have a benchmark.

I will also make another comparison; can scikit-learn pipelines with these embeddings beat an LLM?

Pringled commented 15 hours ago

Cool! Very curious about the results. I'll try to tune in for the livestream.

koaning commented 15 hours ago

I will add the livestream link after the current one, which will also be a fun one by the way.

Pringled commented 14 hours ago

Great, thanks for the link, I'll check it out!