osanseviero / hackerllama

My personal site
Apache License 2.0
64 stars 5 forks source link

hackerllama/blog/posts/sentence_embeddings/ #4

Open utterances-bot opened 7 months ago

utterances-bot commented 7 months ago

hackerllama - Sentence Embeddings

Everything you wanted to know about sentence embeddings (and maybe a bit more)

https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

songxujay commented 7 months ago

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

0ENZO commented 7 months ago

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

arnoldlayne0 commented 7 months ago

Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?

osanseviero commented 7 months ago

Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences

9opsec commented 4 months ago

I think something has changed about the quora dataset used in the colab example. I'm getting this error:

from datasets import load_dataset
dataset = load_dataset("quora")["train"]

TypeError: http_get() got an unexpected keyword argument 'displayed_filename'

optimisa96 commented 2 weeks ago

Just what I needed entering the world of LLMs, thank you a lot!