mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.66k stars 426 forks source link

How to convert pdf to markdown? #1579

Closed MonolithFoundation closed 5 months ago

MonolithFoundation commented 5 months ago

Just can not found any clue to do this. including table and formula

felixdittrich92 commented 5 months ago

Hi,

docTR can't extract structured tabular data atm (it's already on the roadmap to provide a solution) same for formula (not planned yet to integrate)

But you can combine it: formula: https://github.com/lukas-blecher/LaTeX-OCR table: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Table%20Transformer/Using_Table_Transformer_for_table_detection_and_table_structure_recognition.ipynb

To extract the other text + the text instances from the structured table you could use docTR