Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Apache License 2.0
109
stars
15
forks
source link
TLDR-656 add images extraction to ArticleReader #435
ArticleReader
(parsefigure
tag)need_content_analysis
parameterGROBID_URL
AttachAnnotation
adding to PDF documents whenwith_attachments=false