microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
16.6k stars 1.56k forks source link

"Columns must be same length as key" #518

Closed xiaobie-lhm closed 1 month ago

xiaobie-lhm commented 1 month ago

Describe the bug

[Bug]: raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}</p> <h3>Steps to reproduce</h3> <p><em>No response</em></p> <h3>Expected Behavior</h3> <p><em>No response</em></p> <h3>GraphRAG Config Used</h3> <p>encoding_model: cl100k_base skip_workflows: [] llm: api_key: ollama type: openai_chat # or azure_openai_chat model: gemma2 model_supports_json: true # recommended if this is available for your model.</p> <h1>max_tokens: 4000</h1> <h1>request_timeout: 180.0</h1> <p>api_base: <a rel="noreferrer nofollow" target="_blank" href="https://localhost:11434/v1">https://localhost:11434/v1</a></p> <h1>api_version: 2024-02-15-preview</h1> <h1>organization: <organization_id></h1> <h1>deployment_name: <azure_model_deployment_name></h1> <h1>tokens_per_minute: 150_000 # set a leaky bucket throttle</h1> <h1>requests_per_minute: 10_000 # set a leaky bucket throttle</h1> <h1>max_retries: 10</h1> <h1>max_retry_wait: 10.0</h1> <h1>sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times</h1> <h1>concurrent_requests: 25 # the number of parallel inflight requests that may be made</h1> <p>parallelization: stagger: 0.3</p> <h1>num_threads: 50 # the number of threads to use for parallel processing</h1> <p>async_mode: threaded # or asyncio</p> <p>embeddings:</p> <h2>parallelization: override the global parallelization settings for embeddings</h2> <p>async_mode: threaded # or asyncio llm: api_key: lm-studio type: openai_embedding # or azure_openai_embedding model: Publisher/Repository/nomic-embed-text-v1.5.Q5_K_M.gguf api_base: <a rel="noreferrer nofollow" target="_blank" href="http://localhost:1234/v1">http://localhost:1234/v1</a></p> <h1>api_version: 2024-02-15-preview</h1> <pre><code># organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional</code></pre> <p>chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents</p> <p>input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"</p> <p>cache: type: file # or blob base_dir: "cache"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"</p> <h1>connection_string: <azure_blob_storage_connection_string></h1> <h1>container_name: <azure_blob_storage_container_name></h1> <p>entity_extraction:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0</p> <p>summarize_descriptions:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/summarize_descriptions.txt" max_length: 500</p> <p>claim_extraction:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <h1>enabled: true</h1> <p>prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0</p> <p>community_report:</p> <h2>llm: override the global llm settings for this task</h2> <h2>parallelization: override the global parallelization settings for this task</h2> <h2>async_mode: override the global async_mode settings for this task</h2> <p>prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000</p> <p>cluster_graph: max_cluster_size: 10</p> <p>embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes</p> <h1>num_walks: 10</h1> <h1>walk_length: 40</h1> <h1>window_size: 2</h1> <h1>iterations: 3</h1> <h1>random_seed: 597832</h1> <p>umap: enabled: false # if true, will generate UMAP embeddings for nodes</p> <p>snapshots: graphml: false raw_entities: false top_level_nodes: false</p> <p>local_search:</p> <h1>text_unit_prop: 0.5</h1> <h1>community_prop: 0.1</h1> <h1>conversation_history_max_turns: 5</h1> <h1>top_k_mapped_entities: 10</h1> <h1>top_k_relationships: 10</h1> <h1>max_tokens: 12000</h1> <p>global_search:</p> <h1>max_tokens: 12000</h1> <h1>data_max_tokens: 12000</h1> <h1>map_max_tokens: 1000</h1> <h1>reduce_max_tokens: 2000</h1> <h1>concurrency: 32</h1> <h3>Logs and screenshots</h3> <p><em>No response</em></p> <h3>Additional Information</h3> <ul> <li>GraphRAG Version:</li> <li>Operating System:</li> <li>Python Version:</li> <li>Related Issues:</li> </ul> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/s106916"><img src="https://avatars.githubusercontent.com/u/174990069?v=4" />s106916</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>this is a temp hacked solution for ollama <a href="https://github.com/s106916/graphrag">https://github.com/s106916/graphrag</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/xiaobie-lhm"><img src="https://avatars.githubusercontent.com/u/55732208?v=4" />xiaobie-lhm</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>I uses the same configuration as yours, but there is the same bug. pip install graphrag mkdir -p ./ragtest/input curl <a href="https://www.gutenberg.org/cache/epub/24022/pg24022.txt">https://www.gutenberg.org/cache/epub/24022/pg24022.txt</a> > ./ragtest/input/book.txt python -m graphrag.index --init --root ./ragtest change .env and setting.yaml file, the same content as yours python -m graphrag.index --root ./ragtest</p> <p>raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n"</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/s106916"><img src="https://avatars.githubusercontent.com/u/174990069?v=4" />s106916</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>sorry, can you give a try again. Please notes that settings.yaml has altered. use updated settings.yaml</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/linghan16"><img src="https://avatars.githubusercontent.com/u/49943013?v=4" />linghan16</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>This can be solved by adjusting the overlap attribute data of chunks</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Vivi8n24"><img src="https://avatars.githubusercontent.com/u/90703326?v=4" />Vivi8n24</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <blockquote> <p>This can be solved by adjusting the overlap attribute data of chunks</p> </blockquote> <p>how to adjust?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/sadimoodi"><img src="https://avatars.githubusercontent.com/u/32739021?v=4" />sadimoodi</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>same problem</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/GW00287440"><img src="https://avatars.githubusercontent.com/u/163265577?v=4" />GW00287440</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>Using Xinference for models will solve this problem.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/sadimoodi"><img src="https://avatars.githubusercontent.com/u/32739021?v=4" />sadimoodi</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <blockquote> <p>Using Xinference for models will solve this problem.</p> </blockquote> <p>what is Xinference and how to use it?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/GW00287440"><img src="https://avatars.githubusercontent.com/u/163265577?v=4" />GW00287440</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <blockquote> <blockquote> <p>Using Xinference for models will solve this problem.</p> </blockquote> <p>what is Xinference and how to use it?</p> </blockquote> <p>Refer to: <a href="https://inference.readthedocs.io/en/latest/">https://inference.readthedocs.io/en/latest/</a>. Xinference is a large language model inference framework that supports LLM and embedding models, and can seamlessly call OpenAI interfaces.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/AlonsoGuevara"><img src="https://avatars.githubusercontent.com/u/3671933?v=4" />AlonsoGuevara</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <p>Hi! We are consolidating alternate model issues here: <a href="https://github.com/microsoft/graphrag/issues/657">https://github.com/microsoft/graphrag/issues/657</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/kevinYu12138"><img src="https://avatars.githubusercontent.com/u/102788867?v=4" />kevinYu12138</a> commented <strong> 1 month ago</strong> </div> <div class="markdown-body"> <blockquote> <p>This can be solved by adjusting the overlap attribute data of chunks</p> </blockquote> <p>I have tried, and set overlap to 0, 10, 50...: chunks: size: 300 # 300 overlap: 0 # 100 group_by_columns: [id] And it did not work......</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>