Closed fuckqqcom closed 2 months ago
python -m graphrag.index --root ./ragtest
2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容
你是用OpenAI API 或者 Azure OpenAI吗?还是其他模型?好像用其他模型初始化数据都有类似的问题
解释 csv 有没有报错?
解释 csv 有没有报错?
没有错,就是很慢的嘛,有点可以确定的是,新闻的内筒长度比较长,有的可能中文汉字就上万字
2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容
你是用OpenAI API 或者 Azure OpenAI吗?还是其他模型?好像用其他模型初始化数据都有类似的问题 我是用的其他第三方模型,接口稳定很少报错,然后默认的settings文件,我只是把 concurrent_requests: 10 batch_size: 5
你 修改chunk大小为1200 和 100没?
How much concurrency does your LLM model service support?
你 修改chunk大小为1200 和 100没?
没有
How much concurrency does your LLM model service support?
chat: deepseek-chat(deepseek) concurrent_requests: 10 embeddings: embedding-2(zhipu) concurrent_requests: 5
How are your chunks divided? Or should we divide chunks according to the openai tiktoken?
How are your chunks divided? Or should we divide chunks according to the openai tiktoken?
It is done through python - m graphrag. index -- init -- root/ The settings. yaml file generated by ragtest always defaults to the parameters in it, but only modifies some parameters of llm and embeddings,At the same time, I am currently doing it by loading multiple data at once and calling the run_pipelinew_ith_config function in batches
For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.
This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. https://github.com/microsoft/graphrag/discussions/460
This is the original discussion about chunk size which should be able to decrease the total request and your token consumption. #460
ok,tks
For the official chunking logic you are using, you can refer to @KylinMountain’s suggestion and try increasing the chuk size. It is best to combine the logs to carefully analyze the time-consuming areas.
ok,tks
2000多条新闻的csv文件,初始化了10几个小时,请教下有木有其他方式初始化,比如一个我分成100分小文档,然后分别初始化后,可以直接合并output文件下面的内容