opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.19k stars 835 forks source link

[Question] Is this solution best for creating knowledge base for AI/LLM memory in your opinion? #468

Closed HakaishinShwet closed 2 weeks ago

HakaishinShwet commented 3 weeks ago

This tool can extract data from complex files so do you think it is a great solution for extracting and creating knowledge base for llm ?

drunkpig commented 3 weeks ago

@HakaishinShwet You are right, this project was developed for the production of high-quality corpora. Whether it's for the pre-training corpora of large models or for RAG applications, the MinerU project is highly suitable.