osanseviero / hackerllama

My personal site

Apache License 2.0

64 stars 5 forks source link

hackerllama/blog/posts/sentence_embeddings/ #4

Open utterances-bot opened 7 months ago

utterances-bot commented 7 months ago

hackerllama - Sentence Embeddings

Everything you wanted to know about sentence embeddings (and maybe a bit more)

https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

songxujay commented 7 months ago

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

0ENZO commented 7 months ago

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

arnoldlayne0 commented 7 months ago

Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?

osanseviero commented 7 months ago

Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences

CBOW context window is fixed, so it doesn't capture the broader context outside the window
CBOW treats all context words in the same way, while BERT uses attention mechanisms to weigh each token embedding differently
CBOW does not have a sense of directionality due to this
BERT can actually mask multiple tokens at the same time

9opsec commented 4 months ago

I think something has changed about the quora dataset used in the colab example. I'm getting this error:

from datasets import load_dataset
dataset = load_dataset("quora")["train"]

TypeError: http_get() got an unexpected keyword argument 'displayed_filename'

optimisa96 commented 2 weeks ago

Just what I needed entering the world of LLMs, thank you a lot!