run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.75k stars 5.27k forks source link

AssertionError: The batch size should not be larger than 2048. #517

Closed pauldeden closed 1 year ago

pauldeden commented 1 year ago

Using the following code, trying to load emails in csv format exported to a single file from Outlook I get the following error.

`import os from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader

def read_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read()

os.environ["OPENAI_API_KEY"] = read_file('openaiapikey.txt')

if os.path.exists("data/emailindex.json"):

load from disk

index = GPTSimpleVectorIndex.load_from_disk('data/emailindex.json')

else: documents = SimpleDirectoryReader('data/input').load_data() index = GPTSimpleVectorIndex(documents)

save to disk

index.save_to_disk('data/emailindex.json')

while True: prompt = input("Prompt: ") response = index.query(prompt) print(response) `

Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity__init.py", line 409, in call__ result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.

jerryjliu commented 1 year ago

Hi, this should be using text-embedding-ada-002 right? (the batch size should be 8k tokens)

In any case, try setting chunk_size_limit to a smaller value when you build the index index = GPTSimpleVectorIndex(docs, ..., chunk_size_limit=512)

pauldeden commented 1 year ago

Thank you, @jerryjliu.

I made that change index = GPTSimpleVectorIndex(documents, chunk_size_limit=512) and got the following error.

PS C:\Users\paul.eden\Code\llm-emails> python .\emailchat.py Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity__init.py", line 409, in call__ result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.

bpkapkar commented 1 year ago

@jerryjliu @pauldeden I have tried on my document as well, It was giving the same error assert len(list_of_text) <= 2048. I have tried text-davinci-001 model for it , as in the above comment it was mentioned error might be because of text-embedding-ada-002.. When I reduced the size of document deleted 70% of rows and just used 30% of rows i.e 11k , I was able to train the model. I am using the following code to train my model. Requesting you kindly have to look so we can train on large corous of data. def construct_index(directory_path):

set maximum input size

max_input_size = 4096

set number of output tokens

num_outputs = 256

set maximum chunk overlap

max_chunk_overlap = 20

set chunk size limit

chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

define LLM

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-001", max_tokens=num_outputs))

documents = SimpleDirectoryReader("directory_path").load_data()

index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper,chunk_size_limit=512)

index.save_to_disk('index_davinci.json')

return index

maccarini commented 1 year ago

Having the same problem here, any solutions yet?

satpalsr commented 1 year ago

+1 on the problem.

jerryjliu commented 1 year ago

Hi @pauldeden @maccarini @satpalsr, thanks for raising. going to look into this a bit more today!

playztag commented 1 year ago

I'm getting a similar error when I'm inserting large CSV files. Is there a theoretical limit to the size of a single file?

Brightchu commented 1 year ago

i encountered the same issue when dealing with a large txt file.

playztag commented 1 year ago

I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.

81jpayne commented 1 year ago

I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.

Thanks. This worked for me. My original file is about 4MB; I had to split it into 1MB files to get it to work.

logan-markewich commented 1 year ago

closing this issue for now as it should be fixed in newer versions of llama index