microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
19.34k stars 1.91k forks source link

[Bug]: generate_text_embeddings workflow fails due to KeyError #1386

Closed JoedNgangmeni closed 1 week ago

JoedNgangmeni commented 1 week ago

Do you need to file an issue?

Describe the bug

Every time i run the indexer python -m graphrag index --root ./mydirectory --verbose I get the same error.

I know this is a bug because i have tried it with different initializations and different data.

I have tracked it down to have something to do with the way "name_description" is generated.

I am unsure if it is done in graphrag > index > flows > generate_text_embeddings.py > generate_text_embeddings > entity_description_embedding OR graphrag > index > update > entities.py > _run_entity_description_embedding.

Both of these files create name_description by putting together the name and the description of an entity. This causes problems later in the _text_embed_with_vector_store function in (graphrag > index > operations > embed_text > embed_text.py > _text_embed_with_vector_store).

My understanding is that this happens because instead of finding the column title (which should be something like "title", "name", "description", or even "name_description") it is now finding something like "name:description_paragraph". This does not exist as a standalone column.

Other question: Is the generate_text_embeddings process necessary for optimal performance? I'm not actually sure what it even does or why it's needed.

Steps to reproduce

Follow instructions on the graphrag quick start page

Expected Behavior

All verbs and workflows to run smoothly with no errors.

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: azure_openai_chat #or openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  audience: "https://cognitiveservices.azure.com/.default"
  max_tokens: 4000
  request_timeout: 180.0
  api_base:  __REDACTED___
  api_version: '2024-02-15-preview'
  # organization: <organization_id>
  deployment_name: gpt-4o-mini
  tokens_per_minute: 480_000 # set a leaky bucket throttle
  requests_per_minute: 1_400 # set a leaky bucket throttle
  max_retries: 15
  max_retry_wait: 40.0
  sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  concurrent_requests: 25 # the number of parallel inflight requests that may be made
  temperature: 0 # temperature for sampling
  top_p: 1 # top-p sampling
  n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  # batch_size: 16 # the number of documents to send in a single request
  # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  vector_store:
    type: lancedb
    db_uri: 'output/lancedb'
    collection_name: default
    overwrite: true
  # vector_store: # configuration for AI Search
    # type: azure_ai_search
    # url: <ai_search_endpoint>
    # api_key: <api_key> # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
    # audience: <optional> # if using managed identity, the audience to use for the token
    # overwrite: true # or false. Only applicable at index creation time
    # collection_name: <collection_name> # the name of the collection to use. Default: 'default'
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding #or  openai_embedding 
    model: text-embedding-3-large
    api_base: __REDACTED___
    api_version: "2023-05-15"
    audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    deployment_name: text-embedding-3-large
    tokens_per_minute: 290_000 # set a leaky bucket throttle
    requests_per_minute: 1_700 # set a leaky bucket throttle
    max_retries: 15
    max_retry_wait: 40.0
    sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    concurrent_requests: 25 # the number of parallel inflight requests that may be made

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

update_index_storage: # Storage to save an updated index (for incremental indexing). Enabling this performs an incremental index run
  # type: file # or blob
  # base_dir: "update_output"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "logs"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## strategy: fully override the entity extraction strategy.
  ##   type: one of graph_intelligence, graph_intelligence_json and nltk
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

Image

I've added log lines in the code at (graphrag > index > operations > embed_text > embed_text.py > _text_embed_with_vector_store) to understand what is happening. I will provide the full logs as well but those are couple tens or thousands of lines long so first i am providing the relevant exceprts.

Here is the code so you can understand the logs:

async def _text_embed_with_vector_store(
    input: pd.DataFrame,
    callbacks: VerbCallbacks,
    cache: PipelineCache,
    embed_column: str,
    strategy: dict[str, Any],
    vector_store: BaseVectorStore,
    vector_store_config: dict,
    id_column: str = "id",
    title_column: str | None = None,
):

    strategy_type = strategy["type"]
    strategy_exec = load_strategy(strategy_type)
    strategy_args = {**strategy}

    # Get vector-storage configuration
    insert_batch_size: int = (
        vector_store_config.get("batch_size") or DEFAULT_EMBEDDING_BATCH_SIZE
    )

    overwrite: bool = vector_store_config.get("overwrite", True)

    if embed_column not in input.columns:
        msg = f"Column {embed_column} not found in input dataframe with columns {input.columns}"
        raise ValueError(msg)
    title = title_column or embed_column
    if title not in input.columns:
        msg = (
            f"Column {title} not found in input dataframe with columns {input.columns}"
        )
        raise ValueError(msg)
    if id_column not in input.columns:
        msg = f"Column {id_column} not found in input dataframe with columns {input.columns}"
        raise ValueError(msg)

    log.info(f"-------------------- list of columns: {input.columns} \n embed_column: {embed_column} \n title_column: {title_column} \n input[embed_column].columns: {input[embed_column].head()} ------------------------------")

    total_rows = 0
    for row in input[embed_column]:
        log.info(f" ============================== this row: {row} ============================== ")
        if isinstance(row, list):
            total_rows += len(row)
        else:
            total_rows += 1

    i = 0
    starting_index = 0

    all_results = []

    while insert_batch_size * i < input.shape[0]:
        log.debug(f"+++++++++++ Current title {title} +++++++++")
        batch = input.iloc[insert_batch_size * i : insert_batch_size * (i + 1)]
        texts: list[str] = batch[embed_column].to_numpy().tolist()
        titles: list[str] = batch[title].to_numpy().tolist()
        ids: list[str] = batch[id_column].to_numpy().tolist()
        result = await strategy_exec(
            texts,
            callbacks,
            cache,
            strategy_args,
        )
        if result.embeddings:
            embeddings = [
                embedding for embedding in result.embeddings if embedding is not None
            ]
            all_results.extend(embeddings)

        vectors = result.embeddings or []
        documents: list[VectorStoreDocument] = []
        for id, text, title, vector in zip(ids, texts, titles, vectors, strict=True):
            if type(vector) is np.ndarray:
                vector = vector.tolist()
            document = VectorStoreDocument(
                id=id,
                text=text,
                vector=vector,
                attributes={"title": title},
            )
            documents.append(document)

        vector_store.load_documents(documents, overwrite and i == 0)
        starting_index += len(documents)
        i += 1

    return all_results

Attached the logs: indexing-engine.log

Additional Information

chenboju commented 1 week ago

Try setting request_timeout: 180.0 higher, try 210.0 or 1800.0

JoedNgangmeni commented 1 week ago

Try setting request_timeout: 180.0 higher, try 210.0 or 1800.0

I will try it but am confused how that will help the KeyError because I don't understand the keyError to be dependent on some timeout, but on a value pairing.

WeijieZhu93 commented 1 week ago

I've also encountered the same problem since I upgraded to version 0.4.0.

ShanLul commented 1 week ago

I've also encountered the same problem since I upgraded to version 0.4.0.

If you are using a local model, you can use FastChat to encapsulate the embedding model as GPT-4, so that it can run normally.

DollKM commented 1 week ago

I've also encountered the same problem since I upgraded to version 0.4.0.

If you are using a local model, you can use FastChat to encapsulate the embedding model as GPT-4, so that it can run normally.

Can only use GPT-4? This is really bad news. I am using the LLM model glm-4-flash and the embedding model bge-m3, both of which are local models, and I am getting the same error.

ShanLul commented 1 week ago

I've also encountered the same problem since I upgraded to version 0.4.0.

If you are using a local model, you can use FastChat to encapsulate the embedding model as GPT-4, so that it can run normally.

Can only use GPT-4? This is really bad news. I am using the LLM model glm-4-flash and the embedding model bge-m3, both of which are local models, and I am getting the same error.

"No, that's not what I mean. You can use bge-large-zh-v1.5 as your vector model for local deployment. At the same time, you need to write a script file to name your local vector model as gpt-4. In this way, it will be confused with OpenAI's interface in graphrag, so that graphrag can run normally." Below is my sh file configuration. MODEL_WORKER_CUDA_VISIBLE_DEVICES="0" MODEL_WORKER_MODEL_PATH="/root/autodl-tmp/bge-large-zh-v1.5" MODEL_WORKER_MODEL_NAMES="gpt-4" MODEL_WORKER_NUM_GPUS="1" MODEL_WORKER_CONTROLLER_ADDRESS="http://$CONTROLLER_HOST:$CONTROLLER_PORT" Then is my settings.yaml

embeddings: async_mode: threaded # or asyncio vector_store: type: lancedb db_uri: 'output/lancedb' container_name: default # A prefix for the vector store to create embedding containers. Default: 'default'. overwrite: true llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: gpt-4 api_base: http://127.0.0.1:8200/v1

I am currently using FastChat version 0.23.5

JoedNgangmeni commented 1 week ago

I updated to GraphRAG 0.4.1 and it seems this error is fixed