microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
20.17k stars 1.97k forks source link

[Bug]: Unable to query using latest main branch #813

Closed KylinMountain closed 3 months ago

KylinMountain commented 4 months ago

Do you need to file an issue?

Describe the bug

2024-08-03 15:41:05,031 - asyncio - ERROR - Task exception was never retrieved
future: <Task finished name='Task-6' coro=<handle_stream_response.<locals>.run_search() done, defined at /Users/evilkylin/Projects/graphrag/./webserver/main.py:226> exception=ValueError('No objects to concatenate')>
Traceback (most recent call last):
  File "/Users/evilkylin/Projects/graphrag/./webserver/main.py", line 227, in run_search
    result = await search.asearch(request.messages[-1].content, conversation_history)
  File "/Users/evilkylin/Projects/graphrag/./graphrag/query/structured_search/local_search/search.py", line 66, in asearch
    context_text, context_records = self.context_builder.build_context(
  File "/Users/evilkylin/Projects/graphrag/./graphrag/query/structured_search/local_search/mixed_context.py", line 176, in build_context
    community_context, community_context_data = self._build_community_context(
  File "/Users/evilkylin/Projects/graphrag/./graphrag/query/structured_search/local_search/mixed_context.py", line 262, in _build_community_context
    context_text, context_data = build_community_context(
  File "/Users/evilkylin/Projects/graphrag/./graphrag/query/context_builder/community_context.py", line 165, in build_community_context
    context_name.lower(): pd.concat(all_context_records, ignore_index=True)
  File "/Users/evilkylin/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 382, in concat
    op = _Concatenator(
  File "/Users/evilkylin/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 445, in __init__
    objs, keys = self._clean_keys_and_objs(objs, keys)
  File "/Users/evilkylin/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 507, in _clean_keys_and_objs
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
INFO:     127.0.0.1:59844 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Steps to reproduce

I have re-index the whole data and deleted the cache. And every query will report exception

Expected Behavior

Query is normal.

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

9prodhi commented 4 months ago

I encountered similar issues with the latest main branch code. After investigating, I found that there have been significant changes to the prompts in the main branch.

To resolve the issue, I updated the graph extraction prompts in my project by using the prompts from some of my local older projects. Once I made these updates, everything started working as expected.

KylinMountain commented 4 months ago

I have commented this PR but they didn't make any response, https://github.com/microsoft/graphrag/pull/783.

Don't know what's the problem of them?

At first, I was enthusiastic about submitting PRs and fixing bugs, but now that enthusiasm is gone.

natoverse commented 4 months ago

We are investigating this now and hope to have a fix shortly.

@KylinMountain thank you for your engagement with the community! I've just posted some comments to clarify our PR approach, which may help explain things. We very much value your contributions, and will do our best to incorporate them.

AlonsoGuevara commented 4 months ago

Hi @KylinMountain I tried reproducing the issue but wasn't able to. TO help debug, can you please inspect if your community_report.parquet file is generated properly, i.e. not empty?

KylinMountain commented 4 months ago

image

it is not empty, I can query in both local and global in the previous commit. But the latest is not able even if I re-index. it will always report such execption.

INFO: Reading settings from settings.yaml

INFO: Vector Store Args: {}
creating llm client with {'api_key': 'REDACTED,len=35', 'type': "openai_chat", 'model': 'deepseek-chat', 'max_tokens': 4096, 'temperature': 0.0, 'top_p': 0.99, 'n': 1, 'request_timeout': 180.0, 'api_base': 'https://api.deepseek.com/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 500000, 'requests_per_minute': 100, 'max_retries': 3, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 100}
creating embedding llm client with {'api_key': 'REDACTED,len=35', 'type': "openai_embedding", 'model': 'text-embedding-ada-002', 'max_tokens': 4000, 'temperature': 0, 'top_p': 1, 'n': 1, 'request_timeout': 180.0, 'api_base': 'http://localhost:1234/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': None, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 1}
Traceback (most recent call last):
  File "/Users/xxx/Projects/miniconda3/envs/graphrag/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/xxx/Projects/miniconda3/envs/graphrag/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/xx/Projects/graphrag/graphrag/query/__main__.py", line 83, in <module>
    run_local_search(
  File "/Users/xx/Projects/graphrag/graphrag/query/cli.py", line 186, in run_local_search
    result = search_engine.search(query=query)
  File "/Users/xxx/Projects/graphrag/graphrag/query/structured_search/local_search/search.py", line 118, in search
    context_text, context_records = self.context_builder.build_context(
  File "/Users/xxx/Projects/graphrag/graphrag/query/structured_search/local_search/mixed_context.py", line 176, in build_context
    community_context, community_context_data = self._build_community_context(
  File "/Users/xxx/Projects/graphrag/graphrag/query/structured_search/local_search/mixed_context.py", line 262, in _build_community_context
    context_text, context_data = build_community_context(
  File "/Users/xxx/Projects/graphrag/graphrag/query/context_builder/community_context.py", line 165, in build_community_context
    context_name.lower(): pd.concat(all_context_records, ignore_index=True)
  File "/Users/xxx/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 382, in concat
    op = _Concatenator(
  File "/Users/xxx/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 445, in __init__
    objs, keys = self._clean_keys_and_objs(objs, keys)
  File "/Users/xxx/Projects/miniconda3/envs/graphrag/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 507, in _clean_keys_and_objs
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
KylinMountain commented 4 months ago

@AlonsoGuevara You can try set your local search

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
   max_tokens: 4096

so it will multiply default community_prop 0.25 and the new max_tokens is 4096*0.25= 409 which is too small, and which will cause the exception if your first batch tokens is large than 409.

Let's say the code in the build_community_context.

def build_community_context(....):
    ....
    def _init_batch() -> None:
        nonlocal batch_text, batch_tokens, batch_records
        batch_text = (
            f"-----{context_name}-----" + "\n" + column_delimiter.join(header) + "\n"
        )
        batch_tokens = num_tokens(batch_text, token_encoder)
        batch_records = []

    def _cut_batch() -> None:
        # convert the current context records to pandas dataframe and sort by weight and rank if exist
        record_df = _convert_report_context_to_df(
            context_records=batch_records,
            header=header,
            weight_column=community_weight_name
            if entities and include_community_weight
            else None,
            rank_column=community_rank_name if include_community_rank else None,
        )
        if len(record_df) == 0:
            return
        current_context_text = record_df.to_csv(index=False, sep=column_delimiter)
        all_context_text.append(current_context_text)
        all_context_records.append(record_df)

    # initialize the first batch
    _init_batch()

    for report in selected_reports:
        new_context_text, new_context = _report_context_text(report, attributes)
        new_tokens = num_tokens(new_context_text, token_encoder)

        print(batch_tokens, new_tokens, max_tokens)
        if batch_tokens + new_tokens > max_tokens:
            # add the current batch to the context data and start a new batch if we are in multi-batch mode
            _cut_batch()
            if single_batch:
                break
            _init_batch()

        # add current report to the current batch
        batch_text += new_context_text
        batch_tokens += new_tokens
        batch_records.append(new_context)

    # add the last batch if it has not been added
    if batch_text not in all_context_text:
        _cut_batch()

    return all_context_text, {
        context_name.lower(): pd.concat(all_context_records, ignore_index=True)
    }

I have added a debug code that print the new tokens, max tokens and batch_tokens.

        print(batch_tokens, new_tokens, max_tokens)

It will print like

9 1047 409

yeah, batch and new tokens are large than max tokens and the max tokens is too small. So it will going to _cut_batch and then _convert_report_context_to_df.

def _convert_report_context_to_df(
    context_records: list[list[str]],
    header: list[str],
    weight_column: str | None = None,
    rank_column: str | None = None,
) -> pd.DataFrame:
    """Convert report context records to pandas dataframe and sort by weight and rank if exist."""
    print('len context records', len(context_records))
    if len(context_records) == 0:
        return pd.DataFrame()
   ....

the context_record delivered here from the _cut_batch is batch_records, which is initialized with [] and nothing has been filled.

KylinMountain commented 4 months ago

I think the max tokens is the output max token instead of llm input context size, am I wrong?

I don't understand why we use max token - history tokens...

Can you clarify how to set the max tokens or what does these setting item actually mean in local and global query?

natoverse commented 3 months ago

I was able to reproduce the bug using local search with 0.2.1. The PR that @ha2trinh mentions fixes this, which should be resolved now in release 0.2.2.