token limits - Githubissues

manfromdownunder commented 1 year ago

I have run this test project on a small number of pdf's and it works great! I want to expand the number of pdf's to cover of a larger amount of information however I am hitting limits of some kind and the script falls over.

How do you handle large number of tokens? I have tried truncation and different max_lengths but can't seem to understand how to scale this out to a larger data set.

raghavan commented 1 year ago

Hi @manfromdownunder, are you experiencing any limitations due to openAI's rate limiting or some other issue? Can you please share the error message, so I can better understand the problem and assist you accordingly?

manfromdownunder commented 1 year ago

I believe it may be due to OpenAI rate limiting. I am running the script again today and am not seeing any rate limit or script errors today.

Having said that, I looked into option around adding pauses or delays to embeddings to help with API limits but had no luck implementing this. Do you have any guidance on this?

Follow up question, I do not see the vector database being created anywhere, I think it is just a tempdb. Is this how it should be running? I think this is why I hit limits as each re-run of the script is re-creating the database.

thankyou for your guidance and sharing this awesome project sample!

Spoke to soon. Error is:

Traceback (most recent call last): File "/home/user/git/PdfGptIndexer/pdf_gpt_indexer.py", line 68, in <module> db_temp = FAISS.from_documents(chunk, embeddings) File "/home/user/.local/lib/python3.10/site-packages/langchain/vectorstores/base.py", line 336, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs) File "/home/user/.local/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 550, in from_texts return cls.__from( File "/home/user/.local/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 501, in __from index = faiss.IndexFlatL2(len(embeddings[0])) IndexError: list index out of range

I added truncation=true to this code and things move further, however the script still eventualy errors out again. return len(tokenizer.encode(text,max_length=32,truncation=True))

raghavan commented 1 year ago

There are challenges in scaling this tooling because of the rate limits if you use LLM like OpenAI for creating embeddings, and this code does not store the generated vector dataset. So, each time you run this tooling, embeddings are recreated, which is inefficient if you run on a large dataset. Here are some ideas to overcome these limitations.

Token Rate Limit

Here are a few options if you want to overcome rate limits from OpenAI,

1. Multi-key strategy

Instead of using a single OpenAI API key, use a collection of keys are keep switching them each time you encounter a rate-limiting error from OpenAI.

2. Manual sleep

Introduce a sleep before creating embeddings for each chunk. This is a sub-optimal solution but will help you overcome the rate limit issues.

for chunk in all_chunks[1:]:
    db_temp = FAISS.from_documents(chunk, embeddings)
    db.merge_from(db_temp)
    time.sleep(1) #example

Storage

Langchain provides useful documentation on how to store and load vector indices. Utilizing these methods can avoid the need for recreating vectors and enable their reuse once they're generated.

Store existing index

db.save_local("faiss_index")`

Load from stored index

new_db = FAISS.load_local("faiss_index", embeddings)

raghavan / PdfGptIndexer

token limits #2