[Question]: Different numbers of embeddings with the same local embedding model

stephanedebove commented 10 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I’m following up on this question which was closed before I had the time to add my code: https://github.com/run-llama/llama_index/issues/9272

I get very different numbers of embeddings depending on whether I use an explicit text_splitter or not.

Using the attached graham file and the basic BAAI/bge-small-en embedding model graham.txt,

Running this

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en", max_length=512)
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model=embed_model, 
)

gives me 18 embeddings

Parsing nodes: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.09it/s]
Generating embeddings: 100%|███████████████████████████████████████████████████████████| 18/18 [00:06<00:00,  2.86it/s]

Running this

from llama_index.text_splitter import TokenTextSplitter
from transformers import AutoTokenizer
text_splitter = TokenTextSplitter(
    chunk_size=512,
    tokenizer=AutoTokenizer.from_pretrained("BAAI/bge-small-en").encode,
)

service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model="local:BAAI/bge-small-en", 
    text_splitter=text_splitter,
)

gives me 93 embeddings (and notice the error message)

Parsing nodes:   0%|                                                                             | 0/1 [00:00<?, ?it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (16918 > 512). Running this sequence through the model will result in indexing errors
Parsing nodes: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.95s/it]
Generating embeddings: 100%|███████████████████████████████████████████████████████████| 93/93 [00:07<00:00, 11.94it/s]

I am basically just trying to use a local embedding model other than OpenAi, I was originally experimenting with e5-multilingual-large, but I get the same problem with bge-small-en. So what is the correct way to setup a local embedding?

dosubot[bot] commented 10 months ago

🤖

Hello @stephanedebove,

I'm Dosu, a friendly bot here to assist you while we're waiting for a human maintainer. Whether it's bug troubleshooting, answering questions, or helping you to become a contributor, I'm at your service. Let's tackle this issue together!

The discrepancy in the number of embeddings you're seeing is likely due to the different ways the text is being split in each scenario. When you don't provide a text_splitter, LlamaIndex uses a default SentenceSplitter which splits the text into sentences. This might result in fewer chunks (and thus fewer embeddings) compared to the TokenTextSplitter which splits the text into tokens.

The error message you're seeing is because the sequence length of your text exceeds the maximum sequence length of the model (512 tokens for "BAAI/bge-small-en"). The TokenTextSplitter tries to split your text into chunks of 512 tokens, but if a single sentence in your text exceeds this limit, it will cause an error. This is because the TokenTextSplitter doesn't split sentences across chunks.

To resolve this, you could try using a SentenceSplitter with a smaller chunk_size, or pre-process your text to ensure no single sentence exceeds the model's token limit.

As for setting up a local embedding, the process is similar to what you've done. You just need to ensure that the model_name you provide to HuggingFaceEmbedding matches the name of your local model. If your local model is stored in a directory, you should provide the path to that directory as the model_name.

Here's an example:

from llama_index.embeddings import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="/path/to/your/model", max_length=512)

service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model=embed_model, 
)

In this example, replace "/path/to/your/model" with the actual path to your local model directory.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 10 months ago

Dug into this more as the difference is much more egregious with BGE

A few notes on initial behaviour btw

Your first code sample that generates 18 embeddings is using chunk_size=1024, a SentenceSplitter, and a gpt-3.5-turo tokenizer
You second sample that generates 93 embeddings is using chunk_size=512, a TokenTextSplitter, and a huggingface BGE tokenizer
The "warning" about (16918 > 512) is just a standard warning in huggingface. We counted tokens for a sequence that was longer than 512, which is fine, we are counting to split tokens

Now, the real bug here is happening in that the huggingface tokenzier always appends the [CLS] and [SEP] tokens to the start end end of each tokenization call.

In the TextTokenSplitter, we are adding up individual words until their token count is over 512. However, now it's counting each word as 3 tokens instead of, underfilling to an average chunk size of ~270, creating many more embeddings.

So, I need to go and figure how to (nicely) handle the token counting for huggingface tokenizers

logan-markewich commented 10 months ago

as a temp workaround though, really you can just do

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en", max_length=512)
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model=embed_model, 
    chunk_size=512
)

Whic uses a chunk_size=512, SentenceSplitter, and gpt-3.5-tokenizer, which is totally fine to do. The actual difference in token counts between BGE and gpt-3.5 is minuscule, and the embedding model will truncate, but the LLM will not, so its usually better to make the LLM happy :)

stephanedebove commented 10 months ago

@logan-markewich is gpt-3.5-tokenizer under MIT or Apache2 licence and 100% free to use on my local computer? Basically why I’m doing all this instead of using the default models is that I need models with three conditions : 1/ free to use 2/ licence for commercial projects 3/ performance not too bad with non-english languages.

logan-markewich commented 10 months ago

@stephanedebove tiktoken is MIT licensed, free to use as far as I know

https://github.com/openai/tiktoken/blob/main/LICENSE

dosubot[bot] commented 6 months ago

Hi, @stephanedebove,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you encountered a discrepancy in the number of embeddings when using an explicit text splitter versus not using one with a local embedding model. Dosubot provided an explanation for the discrepancy and suggested potential solutions, while logan-markewich identified a bug with the HuggingFace tokenizer and proposed a temporary workaround.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

run-llama / llama_index