run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.74k stars 5.27k forks source link

ValueError: Got a larger chunk overlap (20) than chunk size (-130.0), should be smaller. #904

Closed drhedri1 closed 1 year ago

drhedri1 commented 1 year ago

I continue to get this issue on my code: def construct_index(directory_path, api_key): max_input_size = 0.4 num_output = 100 max_chunk_overlap = 20 chunk_size_limit = 600

llm_predictor = LLMPredictor(
    llm=OpenAI(
        temperature=0.5,
        model_name="text-davinci-003",
        max_tokens=num_output,
        openai_api_key=api_key,
    )
)
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

documents = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
index.save_to_disk("index.json")

return index

Please help!!

logan-markewich commented 1 year ago

Max input size should be an integer, and it should be set to something at least 100 more than your chunk size limit and num_output.

For a model like davinci, the Max input size is 4096 by default (and cannot be higher than this)

drhedri1 commented 1 year ago

PLease help? from gpt_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, PromptHelper from langchain import OpenAI, VectorDBQA from langchain.document_loaders import DirectoryLoader import os from IPython.display import Markdown, display

def construct_index(directory_path, api_key): max_input_size = 4096 num_output = 100 max_chunk_overlap = 20 chunk_size_limit = 600

llm_predictor = LLMPredictor(
    llm=OpenAI(
        temperature=0.5,
        model_name="text-davinci-003",
        max_tokens=num_output,
        openai_api_key=api_key,
    )
)
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

documents = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex(
    documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
index.save_to_disk("index.json")

return index

def ask_ai(): index = GPTSimpleVectorIndex.load_from_disk("index.json") while True: query = input("What's going on, how can I help you? ") response = index.query(query, response_mode="compact") display(Markdown(f'Response: {response.response}'))

os.environ["Open_api_key"] = input('paste your api key here and hit enter: ')

construct_index('/Users/domenicrhedrick/Desktop/GPT_bot/context_data', os.environ["Open_api_key"])

The output is building an index but not allowing me to ask questions : Output - INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens INFO:gpt_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 943 tokens

logan-markewich commented 1 year ago

@drhedri1 You'll want to pass in the llm_predictor and prompt_helper again when loading from disk to keep your settings

index = GPTSimpleVectorIndex.load_from_disk(
    "index.json", llm_predictor=llm_predictor, prompt_helper=prompt_helper
)
hlwhl commented 1 year ago

Met the same issue. Chunk size calculate error, it's will be a negative number when the document is long enough. image

hlwhl commented 1 year ago

Seems index put all document instead of indexed text. @logan-markewich Any ideas?

hlwhl commented 1 year ago

Got it fixed, check you query str is too long.

alexzhang2015 commented 1 year ago

BaseGPTIndex.init() got an unexpected keyword argument 'llm_predictor'

lxe commented 1 year ago

Running into this as well.

logan-markewich commented 1 year ago

@alexzhang2015 @lxe the latest versions of llama index made some changes. The llm predictor goes into a new service context object

https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#customizing-llm-s

lxkaka commented 1 year ago

same issue

ctemple commented 1 year ago

Same issue, are there any updates?

logan-markewich commented 1 year ago

@ctemple @lxkaka there are a few issues in this thread. What's the current issue you are facing?

chinesewebman commented 1 year ago

in site-packages\gpt_index\indices\prompt_helper.py: within def get_chunk_size_given_prompt( there is : result = (self.max_input_size - num_prompt_tokens - self.num_output) // num_chunks

I usually set num_output=3000, after [ValueError: Got a larger chunk overlap (20) than chunk size (-nnn), should be smaller.] error raised, I change to num_output=1500, then the error gone. my chunk_size_limit = 1000, if I want to get longer responds, maybe I should reduce chunk_size_limit too, to reduce num_prompt_tokens during query.

logan-markewich commented 1 year ago

Latest versions of llama-index (v0.6.20) have simplified this process quite a bit

Feel free to re-open this issue if it is still happening, but going to close this for now.

ahwitz commented 1 year ago

Still happening reliably when using the MockLLMPredictor to measure output/develop faster, but I can't quite base-case what's happening.

Our basic workflow (skeleton code at the bottom of this comment) is: 1) Create a SimpleDirectoryReader on two folders, with GPTVectorStoreIndexes on top of them 2) Create a ComposableGraph over the indexes 3) graph.as_query_engine.query() something to return content for new files in those folders 4) Write the new content to those folders 5) Start over at step 1 and repeat a few times

We've added these two print lines in the code that instantiates TokenTextSplitter:

    def get_text_splitter_given_prompt(
        self, prompt: Prompt, num_chunks: int = 1, padding: int = DEFAULT_PADDING
    ) -> TokenTextSplitter:
        """Get text splitter configured to maximally pack available context window,
        taking into account of given prompt, and desired number of chunks.
        """
        chunk_size = self._get_available_chunk_size(prompt, num_chunks, padding=padding)
        if chunk_size == 0:
            raise ValueError("Got 0 as available chunk size.")
        chunk_overlap = int(self.chunk_overlap_ratio * chunk_size)
        print(len(get_empty_prompt_txt(prompt)), num_chunks, padding, chunk_size) # this
        print(self.chunk_overlap_ratio, chunk_overlap) # and this
        text_splitter = TokenTextSplitter(
            separator=self._separator,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            tokenizer=self._tokenizer,
        )
        return text_splitter

...and are getting the following output:

=== Iteration 1
6966 1 5 1624
0.1 162
6778 1 5 1667
0.1 166
6966 1 5 5719
0.1 571
6778 1 5 5762
0.1 576
=== Iteration 2
6721 1 5 1702
0.1 170
6533 1 5 1745
0.1 174
6721 1 5 1702
0.1 170
6533 1 5 1745
0.1 174
=== Iteration 3
10735 1 5 108
0.1 10
10547 1 5 151
0.1 15
11427 1 5 10
0.1 1
12022 1 5 -75
0.1 -7

We're not using any manual values for chunk size/count/input size/etc, just passing default prompts in. This also has not happened yet with a regular LLMPredictor. Any idea what could be going wrong? We'd like to be able to use the MockLLMPredictor to speed up dev work, but can just use the regular one if we need to.


The watered-down version of our code that I was using but wasn't able to base-case is:

import os

from dotenv import load_dotenv
load_dotenv()

# For creating the indexes
from langchain.chat_models import ChatOpenAI
from llama_index import (
    StorageContext, 
    load_index_from_storage, 
    GPTVectorStoreIndex,
    MockLLMPredictor,
    LLMPredictor,
    ServiceContext
)

# For creating the graph query engine
from llama_index.indices.composability import ComposableGraph

# For indexing documents
from llama_index.readers import Document
from llama_index import SimpleDirectoryReader

mock_llm_predictor = MockLLMPredictor()
service_context = ServiceContext.from_defaults(llm_predictor=mock_llm_predictor)

skip_extensions = [".pdf", ".docx", ".pptx", ".jpg", ".png", ".jpeg", ".mp3", ".mp4", ".csv", ".epub", ".md", ".mbox", ".ipynb", ".json"]
exclude = ["**/*" + ext for ext in skip_extensions]
exclude.append(".git/**/*")
subdirs = ['a', 'b']

# Logic more complicated than "range", but effectively...
for x in range(1, 3):
    indexes = {}
    top_dir = './repos/'
    for subdir in subdirs:
        docs = SimpleDirectoryReader(
            f"{top_dir}/{subdir}",
            exclude_hidden=False,
            exclude=exclude,
            recursive=True,
            file_metadata=lambda file : {"file_path": file}
        ).load_data()
        index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
        indexes[subdir] = index

    # We don't persist this because it's tiny and doesn't take much time to generate
    graph = ComposableGraph.from_indices(
        GPTVectorStoreIndex,
        [indexes[subdir] for subdir in indexes],
        index_summaries=[f"Files in {subdir} dir" for subdir in indexes],
        service_context=service_context,
        root_id="root_id"
    )

    query_engine = graph.as_query_engine()
    result = query_engine.query('test?')
    for subdir in subdirs:
        with open(f"{top_dir}/{subdir}/new-file.txt", 'w') as f:
            f.write(result.response)
logan-markewich commented 1 year ago

@ahwitz I ran for 10 iterations and found no error with your code. Are you able to share the data you were reading in initially?

ahwitz commented 1 year ago

Definitely can't share the source data, very likely can't share the source code.

I spent a bit more time basecasing and got a very reliable means of triggering the error, in the Python below. Notes on this:

Name this example.py and run python example.py. Note the lorem_text import at the top, and paragraphs(20) towards the bottom.

import os

from lorem_text import lorem
from dotenv import load_dotenv
load_dotenv()

# For creating the indexes
from llama_index import (
    GPTVectorStoreIndex,
    MockLLMPredictor,
    LLMPredictor,
    ServiceContext
)

# For indexing documents
from llama_index import SimpleDirectoryReader

mock_llm_predictor = MockLLMPredictor()
service_context = ServiceContext.from_defaults(llm_predictor=mock_llm_predictor)

subdirs = ['jerryjliu/llama_index']

# Logic more complicated than "range", but effectively...
for x in range(1, 3):
    indexes = {}
    top_dir = './repos/'
    for subdir in subdirs:
        docs = SimpleDirectoryReader(
            f".",
            input_files=["example.py"],
            file_metadata=lambda file : {"file_path": file}
        ).load_data()
        index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
        indexes[subdir] = index

        query_engine = indexes[subdir].as_query_engine()
        result = query_engine.query(lorem.paragraphs(20))
        print(result)
ahwitz commented 1 year ago

A very easy (and, seemingly, reliably) way to trigger this error is:

model = 'gpt-4'
mode_max_tokens = BaseOpenAI.modelname_to_contextsize(model)
llm = ChatOpenAI(
    temperature=0,
    model_name=model,
    max_tokens=model_max_tokens
)
service_context = ServiceContext.from_defaults(
    chunk_size=model_max_tokens
)

...because, for that combination, llama_index.indices.prompt_helper.py:


    def _get_available_context_size(self, prompt: Prompt) -> int:
        """Get available context size.

        This is calculated as:
            available context window = total context window
                - input (partially filled prompt)
                - output (room reserved for response)
        """
        empty_prompt_txt = get_empty_prompt_txt(prompt)
        prompt_tokens = self._tokenizer(empty_prompt_txt)
        num_prompt_tokens = len(prompt_tokens)

        print(self.context_window, num_prompt_tokens, self.num_output)
        return self.context_window - num_prompt_tokens - self.num_output

... self.context_window - self.num_output will always be 0 that way, and a sufficiently large prompt should trigger negatives.

logan-markewich commented 1 year ago

Yea but thats more user error at that point, rather than some error with our text chunking 🤔

logan-markewich commented 1 year ago

Although I guess the error could maybe be more descriptive if possible? Hmm

ahwitz commented 1 year ago

Yeah, the error message is definitely the problem here. "Check your chunk size settings" is a lot easier to debug than the sorta-generic ValueError, but I'm still not convinced that I've found the only way to trigger it, and I don't know if there's a reliable way to snip off this set of error cases and handle them upstream.

A lot of our problems right now are dealing with trying to maximize the context window, so I'm not surprised we keep butting into this now that I know what the situation is.

dosubot[bot] commented 1 year ago

Hi, @drhedri1! I'm Dosu, and I'm helping the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you encountered a ValueError when running the code, which indicated a larger chunk overlap than the specified chunk size. In the comments, there were suggestions to increase the max input size, pass in the llm_predictor and prompt_helper again when loading from disk, and check the query string length. It seems that you followed these suggestions and were able to resolve the issue.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LlamaIndex repository!