huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.12k stars 215 forks source link

Using SetFit Embeddings for Semantic Search? #120

Open Raidus opened 1 year ago

Raidus commented 1 year ago

Hi,

I was wondering if the semantic search would improve if one would train a multilabel-classification model and use those embeddings?

After training a binary classification model I have seen that the embeddings between similar topics on all-MiniLM-L12-v2 vs all-MiniLM-L12-v2-setfit (fitted model) are very close in fitted model which makes sense for me.

# Cosine Similarity
def get_cosine_similarity(vector1, vector2):
  sim = 1 - spatial.distance.cosine(vector1, vector2)
  return sim

word_1 = "acne"
word_2 = "red skin"

emb_fit_1 = model.model_body.encode([word_1])
emb_fit_2 = model.model_body.encode([word_2])

emb_base_1 = model_sbert.encode([word_1])
emb_base_2 = model_sbert.encode([word_2])

print(f"{word_1} vs {word_2} (base)", get_cosine_similarity(emb_base_1, emb_base_2))
print(f"{word_1} vs {word_2} (fit)", get_cosine_similarity(emb_fit_1, emb_fit_2))
acne vs pimple (base) 0.5959747433662415
acne vs pimple (fit) 0.9996786117553711

acne vs red skin (base) 0.36421263217926025
acne vs red skin (fit) 0.9994498491287231

acne vs red car (base) 0.17558744549751282
acne vs red car (fit) 0.0051751588471233845

I would assume that if the model is trained on multi-label-classification task the embeddings would somehow clustered based on the labels which are provided during training. Would that improve the semantic search if enough labels are provided during training?

Of course I could train a model and test it but maybe you have done similar tests and already know if it's working or not :-)

Thanks!

hanshupe commented 1 year ago

I am very interested in this topic too - planning to use only the fine-tuning part and use the embeddings for semantic search. Any thoughts?

Raidus commented 1 year ago

I have reduced the dimensions with UMAP and visualized the embeddings of the training set with all-MiniLM-L12-v2 vs all-MiniLM-L12-v2-setfit (fitted model). Then I just highlighted every text which includes "acne" and "pimple". The green ones are which do not include "acne" or "pimple". The actual task was a binary classification if a text is related to skincare or not.

It looks like that the model "learned" that "acne" and "pimple" are very close. Their embeddings are closer on average after fitting the model with the training data. I did not calculate the average distance of those embeddings but from a visual point they should be closer together.

That tells me that even after binary classification the embeddings could be used improving the semantic search. I'll do another test with a multi-label classification but creating the training set needs some data wrangling. When I've found some time to do test, I'll post the results here.

setfit_vs_raw

pleonova commented 1 year ago

This is super neat! Thanks for sharing the UMAP comparison @Raidus!

Tangential question, are you uploading your model to the HF hub or you storing the fine-tuned model locally and then calling it to get the embeddings?

tomaarsen commented 1 year ago

Very interesting experimental results. Out of curiosity, the model_sbert/all-MiniLM-L12-v2 SentenceTransformer is not finetuned on the data, right?

karndeepsingh commented 1 year ago

Hi, How I can train the model Setfit model for semantic search assuming I don't have labeled data ( let's say I have product descriptions) then how I can use the trainer Setfit trainer to create positive and negative samples, as per the hugging face blog it needs a few labels to train right? (Correct me if I am wrong) Please, help to understand the process of how I can just use the product description to train setfit model and use that on my queries for semantic search

Thanks

mrzaizai2k commented 1 month ago

Hi, How I can train the model Setfit model for semantic search assuming I don't have labeled data ( let's say I have product descriptions) then how I can use the trainer Setfit trainer to create positive and negative samples, as per the hugging face blog it needs a few labels to train right? (Correct me if I am wrong) Please, help to understand the process of how I can just use the product description to train setfit model and use that on my queries for semantic search

Thanks

I have the same question. How can I finetune the embedding model for my RAG. I need a fast way on my custom dataset