Closed stephanedebove closed 6 months ago
🤖
Hello @stephanedebove,
I'm Dosu, a friendly bot here to assist you while we're waiting for a human maintainer. Whether it's bug troubleshooting, answering questions, or helping you to become a contributor, I'm at your service. Let's tackle this issue together!
The discrepancy in the number of embeddings you're seeing is likely due to the different ways the text is being split in each scenario. When you don't provide a text_splitter
, LlamaIndex uses a default SentenceSplitter
which splits the text into sentences. This might result in fewer chunks (and thus fewer embeddings) compared to the TokenTextSplitter
which splits the text into tokens.
The error message you're seeing is because the sequence length of your text exceeds the maximum sequence length of the model (512 tokens for "BAAI/bge-small-en"). The TokenTextSplitter
tries to split your text into chunks of 512 tokens, but if a single sentence in your text exceeds this limit, it will cause an error. This is because the TokenTextSplitter
doesn't split sentences across chunks.
To resolve this, you could try using a SentenceSplitter
with a smaller chunk_size
, or pre-process your text to ensure no single sentence exceeds the model's token limit.
As for setting up a local embedding, the process is similar to what you've done. You just need to ensure that the model_name
you provide to HuggingFaceEmbedding
matches the name of your local model. If your local model is stored in a directory, you should provide the path to that directory as the model_name
.
Here's an example:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="/path/to/your/model", max_length=512)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
)
In this example, replace "/path/to/your/model"
with the actual path to your local model directory.
I hope this helps! Let me know if you have any other questions.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Dug into this more as the difference is much more egregious with BGE
A few notes on initial behaviour btw
(16918 > 512)
is just a standard warning in huggingface. We counted tokens for a sequence that was longer than 512, which is fine, we are counting to split tokensNow, the real bug here is happening in that the huggingface tokenzier always appends the [CLS]
and [SEP]
tokens to the start end end of each tokenization call.
In the TextTokenSplitter
, we are adding up individual words until their token count is over 512. However, now it's counting each word as 3 tokens instead of, underfilling to an average chunk size of ~270, creating many more embeddings.
So, I need to go and figure how to (nicely) handle the token counting for huggingface tokenizers
as a temp workaround though, really you can just do
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en", max_length=512)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
chunk_size=512
)
Whic uses a chunk_size=512, SentenceSplitter, and gpt-3.5-tokenizer, which is totally fine to do. The actual difference in token counts between BGE and gpt-3.5 is minuscule, and the embedding model will truncate, but the LLM will not, so its usually better to make the LLM happy :)
@logan-markewich is gpt-3.5-tokenizer under MIT or Apache2 licence and 100% free to use on my local computer? Basically why I’m doing all this instead of using the default models is that I need models with three conditions : 1/ free to use 2/ licence for commercial projects 3/ performance not too bad with non-english languages.
@stephanedebove tiktoken is MIT licensed, free to use as far as I know
Hi, @stephanedebove,
I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you encountered a discrepancy in the number of embeddings when using an explicit text splitter versus not using one with a local embedding model. Dosubot provided an explanation for the discrepancy and suggested potential solutions, while logan-markewich identified a bug with the HuggingFace tokenizer and proposed a temporary workaround.
Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!
Question Validation
Question
I’m following up on this question which was closed before I had the time to add my code: https://github.com/run-llama/llama_index/issues/9272
I get very different numbers of embeddings depending on whether I use an explicit text_splitter or not.
Using the attached graham file and the basic BAAI/bge-small-en embedding model graham.txt,
Running this
gives me 18 embeddings
Running this
gives me 93 embeddings (and notice the error message)
I am basically just trying to use a local embedding model other than OpenAi, I was originally experimenting with e5-multilingual-large, but I get the same problem with bge-small-en. So what is the correct way to setup a local embedding?