Open 9prodhi opened 1 month ago
When building the graph, the most time-consuming part seems to be accessing the LLM. Even though the code thoughtfully uses asynchronous methods, the time consumption is still significant. I attempted to modify the code to batch mode for the LLM, but the data involves multiple layers of API calls, making it difficult to implement. I’m curious whether the size of the data used by the publisher for experiments is only for laboratory mode.
I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.
Here are the details of the issue:
Small File Processing:
Large File Processing:
System specs:
Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:
I am using the
num_threads
andbatch_size
parameters to parallelize the nomic_embed verb for reducing processing time of large files.Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?