Closed pauldeden closed 1 year ago
Hi, this should be using text-embedding-ada-002 right? (the batch size should be 8k tokens)
In any case, try setting chunk_size_limit to a smaller value when you build the index index = GPTSimpleVectorIndex(docs, ..., chunk_size_limit=512)
Thank you, @jerryjliu.
I made that change index = GPTSimpleVectorIndex(documents, chunk_size_limit=512)
and got the following error.
PS C:\Users\paul.eden\Code\llm-emails> python .\emailchat.py Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity__init.py", line 409, in call__ result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.
@jerryjliu @pauldeden I have tried on my document as well, It was giving the same error assert len(list_of_text) <= 2048. I have tried text-davinci-001 model for it , as in the above comment it was mentioned error might be because of text-embedding-ada-002.. When I reduced the size of document deleted 70% of rows and just used 30% of rows i.e 11k , I was able to train the model. I am using the following code to train my model. Requesting you kindly have to look so we can train on large corous of data. def construct_index(directory_path):
max_input_size = 4096
num_outputs = 256
max_chunk_overlap = 20
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-001", max_tokens=num_outputs))
documents = SimpleDirectoryReader("directory_path").load_data()
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper,chunk_size_limit=512)
index.save_to_disk('index_davinci.json')
return index
Having the same problem here, any solutions yet?
+1 on the problem.
Hi @pauldeden @maccarini @satpalsr, thanks for raising. going to look into this a bit more today!
I'm getting a similar error when I'm inserting large CSV files. Is there a theoretical limit to the size of a single file?
i encountered the same issue when dealing with a large txt file.
I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.
I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.
Thanks. This worked for me. My original file is about 4MB; I had to split it into 1MB files to get it to work.
closing this issue for now as it should be fixed in newer versions of llama index
Using the following code, trying to load emails in csv format exported to a single file from Outlook I get the following error.
`import os from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
def read_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read()
os.environ["OPENAI_API_KEY"] = read_file('openaiapikey.txt')
if os.path.exists("data/emailindex.json"):
load from disk
else: documents = SimpleDirectoryReader('data/input').load_data() index = GPTSimpleVectorIndex(documents)
save to disk
while True: prompt = input("Prompt: ") response = index.query(prompt) print(response) `
Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity__init.py", line 409, in call__ result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.