The *_final*.parquet files are not being created

KannamSridharKumar commented 3 months ago

I've downloaded the latest code and run the indexing. The _final.parquet files are not being created in the output/artifacts directory.

I ran the GraphRAG from command line using Microsoft git repo code before, so I believe I'm not doing any mistake.

bw-Deejee commented 3 months ago

Not sure if thats the same error but im also getting that error: FileNotFoundError: [Errno 2] No such file or directory: .../GraphRAG-Ollama-UI/ragtest/output/20240715-072352/artifacts/create_final_nodes.parquet'

severian42 commented 3 months ago

Hey! Thanks for creating this Issue. I am not 100% sure why the _final.parquet isn't showing up on your end. I just re-ran the indexing on some new text and I got the full output files including the _final.parquet files. The GraphRAG workflow is the exact same as the Microsoft repo, under the hood I am really just running the same command line script that they provided as an example with it being executed by Gradio. I'll keep looking into this and see if I can recreate the issue and/or write some code to guarantee the full output of all the files. A major update will also happen later today so I'll do my best to get your issue squared away as well so you can get this going properly

KannamSridharKumar commented 3 months ago

@bw-Deejee yes, I'm getting the same error.

ffdfo commented 3 months ago

me too

severian42 commented 3 months ago

Still working on debugging and figuring this out. I think it has something to do with switching out the original embedder and how the llm is routed to make it localized. I am doing a major refactor to address a lot of issues so this should be solved within that. Hoping to have the changes up by end of day today or early tomorrow

YVMVN commented 3 months ago

@KannamSridharKumar @ffdfo @severian42

Check the indexing-engine.log in output folder, it will tell you where it got lost. I probably know what is root cause. This happened to me when i used local embedding "nomic-embed-text-v1.5", instead of "nomic-embed-text". In addition to changing "settings.yaml" file, where you select model names for embedding and llm, you need to set embedding name in openai_embeddings_llm.py file. Path: /graphrag/llm/openai/openai_embeddings_llm.py At the end of the file replace this line: embedding = ollama.embeddings(model="nomic-ai/nomic-embed-text-v1.5-GGUF", prompt=inp) with desired embedding model name.

severian42 commented 3 months ago

Thanks for the fix @YVMVN! This worked for me as the models didn't match when using the repo version. If it still doesn't work we can keep troubleshooting. The new version with the updates will be a bit different to allow other providers than Ollama then the current logic

CarolVim commented 3 months ago

it doesn't work?it is the same error: Error: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/chanchen/Desktop/ollama_ui_graphy/GraphRAG-Ollama-UI/graphrag/query/main.py", line 84, in run_global_search( File "/Users/chanchen/Desktop/ollama_ui_graphy/GraphRAG-Ollama-UI/graphrag/query/cli.py", line 67, in run_global_search final_nodes: pd.DataFrame = pd.read_parquet( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parquet.py", line 667, in read_parquet return impl.read( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parquet.py", line 267, in read path_or_handle, handles, filesystem = _get_path_or_handle( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/parquet.py", line 140, in _get_path_or_handle handles = get_handle( File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/io/common.py", line 882, in get_handle handle = open(handle, ioargs.mode) NotADirectoryError: [Errno 20] Not a directory: '/Users/chanchen/Desktop/ollama_ui_graphy/GraphRAG-Ollama-UI/ragtest/output/.DS_Store/artifacts/create_final_nodes.parquet'

CarolVim commented 3 months ago

the indexing-log:00:23:47,427 graphrag.config.read_dotenv INFO No .env file found at ./ragtest 00:23:47,430 graphrag.index.cli INFO using default configuration: { "llm": { "api_key": "REDACTED, length 19", "type": "openai_chat", "model": "mistral:7b", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "root_dir": "./ragtest", "reporting": { "type": "file", "base_dir": "output/${timestamp}/reports", "storage_account_blob_url": null }, "storage": { "type": "file", "base_dir": "output/${timestamp}/artifacts", "storage_account_blob_url": null }, "cache": { "type": "file", "base_dir": "cache", "storage_account_blob_url": null }, "input": { "type": "file", "file_type": "text", "base_dir": "input", "storage_account_blob_url": null, "encoding": "utf-8", "file_pattern": ".\.txt$", "file_filter": null, "source_column": null, "timestamp_column": null, "timestamp_format": null, "text_column": "text", "title_column": null, "document_attribute_columns": [] }, "embed_graph": { "enabled": false, "num_walks": 10, "walk_length": 40, "window_size": 2, "iterations": 3, "random_seed": 597832, "strategy": null }, "embeddings": { "llm": { "api_key": "REDACTED, length 19", "type": "openai_embedding", "model": "nomic_embed_text", "max_tokens": 4000, "temperature": 0, "top_p": 1, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": null, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "batch_size": 16, "batch_max_tokens": 8191, "target": "required", "skip": [], "vector_store": null, "strategy": null }, "chunks": { "size": 512, "overlap": 64, "group_by_columns": [ "id" ], "strategy": null }, "snapshots": { "graphml": true, "raw_entities": true, "top_level_nodes": true }, "entity_extraction": { "llm": { "api_key": "REDACTED, length 19", "type": "openai_chat", "model": "mistral:7b", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/entity_extraction.txt", "entity_types": [ "organization", "person", "geo", "event" ], "max_gleanings": 0, "strategy": null }, "summarize_descriptions": { "llm": { "api_key": "REDACTED, length 19", "type": "openai_chat", "model": "mistral:7b", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/summarize_descriptions.txt", "max_length": 500, "strategy": null }, "community_reports": { "llm": { "api_key": "REDACTED, length 19", "type": "openai_chat", "model": "mistral:7b", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": null, "max_length": 2000, "max_input_length": 8000, "strategy": null }, "claim_extraction": { "llm": { "api_key": "REDACTED, length 19", "type": "openai_chat", "model": "mistral:7b", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "request_timeout": 180.0, "api_base": "http://localhost:11434/v1", "api_version": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 10 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "enabled": false, "prompt": "prompts/claim_extraction.txt", "description": "Any claims or facts that could be relevant to information discovery.", "max_gleanings": 0, "strategy": null }, "cluster_graph": { "max_cluster_size": 10, "strategy": null }, "umap": { "enabled": false }, "local_search": { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "top_k_entities": 10, "top_k_relationships": 10, "max_tokens": 12000, "llm_max_tokens": 2000 }, "global_search": { "temperature": 0.0, "top_p": 1.0, "max_tokens": 12000, "data_max_tokens": 12000, "map_max_tokens": 1000, "reduce_max_tokens": 2000, "concurrency": 32 }, "encoding_model": "cl100k_base", "skip_workflows": [] } 00:23:47,431 graphrag.index.create_pipeline_config INFO skipping workflows 00:23:47,438 graphrag.index.run INFO Running pipeline 00:23:47,438 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at ragtest/output/20240716-002347/artifacts 00:23:47,438 graphrag.index.input.load_input INFO loading input from root_dir=input 00:23:47,438 graphrag.index.input.load_input INFO using file storage for input 00:23:47,439 graphrag.index.storage.file_pipeline_storage INFO search ragtest/input for files matching ..txt$ 00:23:47,439 graphrag.index.input.text INFO found text files from input, found [('article233.txt', {}), ('article23.txt', {}), ('九州捭阖录屠龙之主.txt', {}), ('article.txt', {}), ('Design and Biomimicry_ A Review of Interco - Alice Araujo Marques de Sa.txt', {}), ('article2.txt', {}), ('xiaoshuo.txt', {})]

CarolVim commented 3 months ago

@KannamSridharKumar @ffdfo @severian42

检查indexing-engine.log输出文件夹中的，它会告诉你它在哪里丢失了。我可能知道根本原因是什么。当我使用本地嵌入"nomic-embed-text-v1.5"而不是时，这种情况发生在我身上"nomic-embed-text"。除了更改文件（在其中选择嵌入和 llm 的模型名称）之外"settings.yaml"，您还需要在 openai_embeddings_llm.py 文件中设置嵌入名称。路径：/graphrag/llm/openai/openai_embeddings_llm.py 在文件末尾将此行：替换 embedding = ollama.embeddings(model="nomic-ai/nomic-embed-text-v1.5-GGUF", prompt=inp)为所需的嵌入模型名称。

i try your ways,but it didn't work.it make me curious.

severian42 commented 3 months ago

You will want to make sure that the embedding model is named identically within the setting.yml and OpenAIEmbeddingsLLM if you add it to there as well. @CarolVim, I saw in your logging that the Nomic Embed model is named "nomic_embed_text". I had to change this to nomic-embed-text to match my model named in settings.yml and it worked. See if that works by chance and I'll see if I can find other potential causes

CarolVim commented 3 months ago

i ran the ollama list command and it show the result: (base) chanchen@MacBook-Air input % ollama list NAME ID SIZE MODIFIED
mistral:latest 2ae6f6dd7a3d 4.1 GB 28 hours ago
nomic-embed-text:latest 0a109f422b47 274 MB 5 days ago
internlm2:7b-chat-v2.5-q4_K_S e978b21250f8 4.5 GB 11 days ago
mxbai-embed-large:latest 468836162de7 669 MB 11 days ago
deepseek-coder-v2:latest 8577f96d693e 8.9 GB 2 weeks ago
llava:latest 8dd30f6b0cb1 4.7 GB 2 weeks ago
qwen2:7b e0d4e1163c58 4.4 GB 5 weeks ago

CarolVim commented 3 months ago

i tried it .but it didn't work.it mad me sad.

CarolVim commented 3 months ago

i checked the folder,it didn't found _final.parquet .i ran this on my macbook m1.

JB5579 commented 3 months ago

Looks like I am having the same issue as many others. Have the very same error when I run GraphRAG CLI.

Error: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in run_code File "D:\GraphRAG-Ollama-UI\graphrag\query_main.py", line 84, in run_global_search( File "D:\GraphRAG-Ollama-UI\graphrag\query\cli.py", line 67, in run_global_search final_nodes: pd.DataFrame = pd.read_parquet( ^^^^^^^^^^^^^^^^ File "C:\Users\14045\AppData\Roaming\Python\Python311\site-packages\pandas\io\parquet.py", line 667, in read_parquet return impl.read( ^^^^^^^^^^ File "C:\Users\14045\AppData\Roaming\Python\Python311\site-packages\pandas\io\parquet.py", line 267, in read path_or_handle, handles, filesystem = _get_path_or_handle( ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\14045\AppData\Roaming\Python\Python311\site-packages\pandas\io\parquet.py", line 140, in _get_path_or_handle handles = get_handle( ^^^^^^^^^^^ File "C:\Users\14045\AppData\Roaming\Python\Python311\site-packages\pandas\io\common.py", line 882, in get_handle handle = open(handle, ioargs.mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'D:\GraphRAG-Ollama-UI\ragtest\output\20240715-230142\artifacts\create_final_nodes.parquet'

JB5579 commented 3 months ago

indexing-engine.log

This the log what was produced from indexing job 20240715-230142, which was said to be completed. This log file is completely blank... IDK...

YVMVN commented 3 months ago

@JB5579 I don't understand this architecture so well yet, however, based on my understanding, the log is blank, cuz the indexing wasn't even initialised. Try checking the other file in the same folder named logs.json

severian42 commented 3 months ago

This most recent update should solve the issues with indexing and creating the needed output files. Give it a try and report back if you still encounter any errors in indexing and then querying.

How to query the generated graph:

-If you are able to run the Indexing with no errors you will end up with the full output files. -Once you have those, you will need to initialize the folder within the Index Management tab. -Once you have initialized (should be 20 items in total) it will make the graph available to query with the LLM

ckj18 commented 3 months ago

Let me tell you about the solution I found. By looking at the indexing-engine.log in output/reports, you can identify the issue. The problem was with the llm model. Ollama had mistral installed. I initially thought it wasn't an issue because the existing Settings listed mistral:7b, but the parquet files still weren't being generated. When I changed the model name to mistral, it started working. You should give it a try as well. Also, always make sure to test with the latest version of Ollama.

severian42 commented 3 months ago

Genius! Thanks for catching this. I think I need to find a better way to maintain 1 model variable across all the different configs and settings. I am still trying to wrap my head around the way Microsoft is handling the processing and hierarchy. The new refactored version (not pushed yet) works a bit differently since I added the ability to set your own OpenAI-compatible base URL and model. Hopefully, we won't run into this after the update but in the meantime, @ckj18 found a great workaround. Many thanks!

severian42 / GraphRAG-Local-UI