Closed coreation closed 6 months ago
@coreation from the top of my mind, two immediate thoughts:
4000
is actually quite large. Chunk size is usually set in the area of 256-512 tokens. It would be very hard for an embedding model to represent such a long text into a single semantic representation. This might explain the poor retrieval results.chunk_size
). How did you change these parameters? By setting them in a config file? If so - when you tested manually with Python code, have you used the same config file and\or the same parameter values? Otherwise it had probably used the built-in defaults.@igiloh-pinecone thanks for the quick thoughts,
I found it a bit large as well, but I found the default in this file. The chunk_size there is set to 4000... Am I reading this wrong?
I'm not using the embedding via canopy directly as I need to embed some things coming from a database, so I used the splitter, and thus the "4000" chunk size and then did the embedding and storing myself in Pinecone. I'll try to use 1024 as I've read a couple of articles, amongst which this one that point out there's no real "good size", but 1024 seems to be a good sweet spot. I'll re-embed everything and see if things get better. If we could get some feedback on the 4000 chunk_size in the langchain_text_splitter, then I'm happy to make a small PR changing it to 1024 or 512.
@igiloh-pinecone , it looks like the "chunk_size" in the splitter is actually referring to the amount of characters... I'm using the OpenAI tokenizer online to see how many tokens my pieces of text have and they're all around 700 tokens instead of 4000. So I've misinterpreted "chunk size" for "token size".
So the code to split up my text is as the default is described in the LangChain text splitter:
RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
@igiloh-pinecone I've received a Github comment update but I don't see your update here on the thread. But just to answer the question you wrote, I'm using indeed the LangChain Recursive splitter, which I believe does take the parameter in terms of "characters", not tokens. If I take random samples of my vectors, the text is around 700 tokens.
Is your suggestion to lower that amount and use the Canopy Chunker?
I noticed your previous message where you stated that you use langchain directly, so I deleted mine as it was irrelevant.
My main point wasn't actually about the chunk size itself (I guess ~700 tokens is workable) - but rather a question of how do you configure your canopy server
versus how did you configure the direct python ChatEngine
. Are you sure you've used the same config \ params?
On more suggestion - can you please try repeating the same question 2-3 times in each scenario (server API vs direct python class)? Could it be that the underlying LLM is simply a bit "noisy", answering the same question differently every time?
hey @igiloh-pinecone what you see in the code example is the only configuration I use, it's the same as the variables I export before I star the canopy server. So that's simply the pinecone API key, OpenAI API key and index/namespace.
I've tried repeating the questions, but it seems like it's the same thing. See images on the bottom of the comment, one contains sources, the other does not.
Do you guys offer any paid support by chance? I'm sort of knowledgeable on a high-level about a couple of RAG frameworks, this is the first that does away with a lot of fluff because it's more tailored towards a use case. I can try again using LangChain, but the state of LangChain for so meone who isn't day-in day-out on top it, is just too much to keep up with.
Responses using canopy server
Responses using the code example from library.md
@igiloh-pinecone perhaps not unimportant, the "text" property in Pinecone is often displayed as type "[]" instead of text, is this because of the new lines from the chunking?
@igiloh-pinecone ... I found the issue after letting the code ponder in the back of my head :) The default encoder has changed to the latest openai embedding (small) model, while my embeddings were still on ada embeddings. I see that in a previous canopy install, in all likelihood, the one running canopy server, the default encoder is still pointing to ada-002... So that explains the trash results my knowledge base gave me, while at the same time the built-in canopy server is returning decent RAG based results.
I'll see if I find the time to make a documentation PR so that the encoder is explicitly passed in the advanced example in the library.md.
The default encoder has changed to the latest openai embedding (small) model, while my embeddings were still on ada embeddings.
Thanks for the detailed response @coreation !!
That's definitely an oversight by us. We shouldn't have changed the default like that without at least highlighting it as a breaking change. I will change the issue's name to make it more discoverable by other users encountering the same problem.
Gist for other people encountering this problem:
Before version 0.7.0
, Canopy's default RecordEncoder
was OpenAI(model_name='text-embedding-ada-002')
. In version 0.7.0
, the default was changed to use OpenAI's new embedding model (text-embedding-3-small
).
If you have inserted your documents in the past using an older Canopy version, then upgraded Canopy and tried using the query()
or chat()
functions - your newly loaded instance would be using a different embedding model than the one used for inserting documents.
To fix this problem:
canopy create-config <path>
command to generate Canopy's default config templates in your desired <path>
default.yaml
file, changing the embedding model_name
to text-embedding-ada-002
.canopy start --config <path>/default.yaml
Thanks for the swift response @igiloh-pinecone!
Hi there!
First of all, as we're dealing with open source, big kudos to the maintainers and to @igiloh-pinecone and @izellevy who have been responsive in the past on some questions!
My use case is that I would really need to capture the different sources / vectors that were used making the final response, which is currently only possible by making the ChatEngine call in the code instead of using the built-in REST API of canopy-server.
The issue I'm facing is that based on the same Pinecone index and namespace, I'm getting very different responses using the built-in REST API and the proposed code from the library.md docs.
Configuration
For simplicity sake, I'll stick to 1 example:
After the accumulator is loaded, the power supply is interrupted and a switch made of likewise superconducting material is actuated. This switch is responsible for disconnecting the coil from the inverter. The circuit is then reconnected to the inverter to discharge the stored energy. In this way, alternating current is generated from the direct current.
The efficiency of this type of energy storage system for generating direct current is around 97 percent. However, considerable cooling requirements need to be taken into account, which often stand in the way of the technology’s economic industrial use.
Scenario output
Scenario 1: Using the built-in Canopy REST API endpoint
Scenario 2: Using the python code mentioned above - varies from either no response or a response made up solely by the LLM
Scenario 3: Using the python code mentioned above - but with only the 1 relevant document in the pinecone index
Findings
Difference in retrieval by the built-in canopy REST API then in the example
By logging the debug info, it looks like scenario 2 fetches all kinds of unrelated vectors, having nothing to do with the initial question, the scores for those documents are around 0.06. Checking out the scores for Scenario 3, it did find the content - as it was the only document in the pinecone index for that scenario - and the score of the document wass -0.03 roughly.
So it seems it gets the retrieval part "wrong", but what is odd is that the built-in canopy server, which afaik uses default values just like in the code example, does get the retrieval part right. So my assumption is very likely wrong, and I'm missing some sort of addition initialisation or configuration.
This is where I'm kind of hard stuck and would greatly appreciate any pointers and I'll be more than happy to make a PR making the library.md more complete!