[Bug]: 初始化数据很慢

microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system

https://microsoft.github.io/graphrag/

MIT License

18.64k stars 1.82k forks source link

[Bug]: 初始化数据很慢 #699

Closed fuckqqcom closed 2 months ago

fuckqqcom commented 3 months ago

python -m graphrag.index --root ./ragtest

2000多条新闻的csv文件，初始化了10几个小时，请教下有木有其他方式初始化，比如一个我分成100分小文档，然后分别初始化后，可以直接合并output文件下面的内容

zejiancai commented 3 months ago

python -m graphrag.index --root ./ragtest
2000多条新闻的csv文件，初始化了10几个小时，请教下有木有其他方式初始化，比如一个我分成100分小文档，然后分别初始化后，可以直接合并output文件下面的内容

你是用OpenAI API 或者 Azure OpenAI吗？还是其他模型？好像用其他模型初始化数据都有类似的问题

boxingYi commented 3 months ago

解释 csv 有没有报错？

fuckqqcom commented 3 months ago

解释 csv 有没有报错？

没有错，就是很慢的嘛，有点可以确定的是，新闻的内筒长度比较长，有的可能中文汉字就上万字

fuckqqcom commented 3 months ago

2000多条新闻的csv文件，初始化了10几个小时，请教下有木有其他方式初始化，比如一个我分成100分小文档，然后分别初始化后，可以直接合并output文件下面的内容

你是用OpenAI API 或者 Azure OpenAI吗？还是其他模型？好像用其他模型初始化数据都有类似的问题我是用的其他第三方模型，接口稳定很少报错，然后默认的settings文件，我只是把 concurrent_requests: 10 batch_size: 5

KylinMountain commented 3 months ago

你修改chunk大小为1200 和 100没？

Nuclear6 commented 3 months ago

How much concurrency does your LLM model service support?

fuckqqcom commented 3 months ago

你修改chunk大小为1200 和 100没？

没有

fuckqqcom commented 3 months ago

How much concurrency does your LLM model service support?

chat: deepseek-chat(deepseek) concurrent_requests: 10 embeddings: embedding-2(zhipu) concurrent_requests: 5

Nuclear6 commented 3 months ago

How are your chunks divided? Or should we divide chunks according to the openai tiktoken?

fuckqqcom commented 3 months ago

How are your chunks divided? Or should we divide chunks according to the openai tiktoken?

It is done through python - m graphrag. index -- init -- root/ The settings. yaml file generated by ragtest always defaults to the parameters in it, but only modifies some parameters of llm and embeddings，At the same time, I am currently doing it by loading multiple data at once and calling the run_pipelinew_ith_config function in batches

Nuclear6 commented 3 months ago

For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.

KylinMountain commented 3 months ago

This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. https://github.com/microsoft/graphrag/discussions/460

fuckqqcom commented 3 months ago

This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. #460

ok,tks

fuckqqcom commented 3 months ago

For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.

ok,tks