Closed ling-chun closed 1 month ago
@ling-chun You might find PDF-Extract-Kit and MinerU to be exactly what you need. PDF-Extract-Kit offers a range of document parsing models that allow for customization. For instance, you can use DocLayout-YOLO for region detection and then apply PaddleOCR for character recognition within those regions. On the other hand, MinerU serves as an all-in-one document parsing tool that takes a PDF as input and outputs the corresponding Markdown result. It also provides JSON results from each model if required.
thanks for your comment, I'll try it.
Thanks for your brilliant work! That's helped me a lot! And I would like to know if there is a simple way to extract the raw words from the result image, since I have a .pdf format file which includes an academic essay. I split the PDF file into several images and predicted them separately, then I got several images. But I would like to extract the structured text from the source PDF file. Is there any simple method to finish that? For example, let's use your paper (https://arxiv.org/pdf/2410.12628) as the source file. I would like to get the following JSON format result: { element_id:1, type:title, content:DOCLAYOUT-YOLO: ENHANCING DOCUMENT LAYOUT ANALYSIS THROUGH DIVERSE SYNTHETIC DATA AND GLOBAL-TO-LOCAL ADAPTIVE PERCEPTION } { element_id:2, type:plain, content:Zhiyuan Zhao∗, Hengrui Kang∗, Bin Wang, Conghui He †Shanghai Artificial Intelligence Laboratory } { element_id:3, type:title, content:ABSTRACT } ...