extract file words - Githubissues

ling-chun commented 1 month ago

Thanks for your brilliant work! That's helped me a lot! And I would like to know if there is a simple way to extract the raw words from the result image, since I have a .pdf format file which includes an academic essay. I split the PDF file into several images and predicted them separately, then I got several images. But I would like to extract the structured text from the source PDF file. Is there any simple method to finish that? For example, let's use your paper (https://arxiv.org/pdf/2410.12628) as the source file. I would like to get the following JSON format result: { element_id:1, type:title, content:DOCLAYOUT-YOLO: ENHANCING DOCUMENT LAYOUT ANALYSIS THROUGH DIVERSE SYNTHETIC DATA AND GLOBAL-TO-LOCAL ADAPTIVE PERCEPTION } { element_id:2, type:plain, content:Zhiyuan Zhao∗, Hengrui Kang∗, Bin Wang, Conghui He †Shanghai Artificial Intelligence Laboratory } { element_id:3, type:title, content:ABSTRACT } ...

wangbinDL commented 1 month ago

@ling-chun You might find PDF-Extract-Kit and MinerU to be exactly what you need. PDF-Extract-Kit offers a range of document parsing models that allow for customization. For instance, you can use DocLayout-YOLO for region detection and then apply PaddleOCR for character recognition within those regions. On the other hand, MinerU serves as an all-in-one document parsing tool that takes a PDF as input and outputs the corresponding Markdown result. It also provides JSON results from each model if required.

ling-chun commented 1 month ago

thanks for your comment, I'll try it.

opendatalab / DocLayout-YOLO

extract file words #3