ispras / dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Apache License 2.0
111 stars 15 forks source link

Cannot extract tables using PdfTxtlayerReader #373

Closed ValiullinAlbert closed 7 months ago

ValiullinAlbert commented 7 months ago

When I read a document using PdfTxtlayerReader, there are no tables even though the document contains some. This document is copyable. Could you tell me what I am doing wrong?

NastyBoget commented 7 months ago

Try to use need_pdf_table_analysis="True" in the parameters, e.g.:

reader = PdfTxtlayerReader(config={})
document = reader.read(file_path, parameters=dict(need_pdf_table_analysis="True")

If this doesn't help, could you please send us the document and the code you are using to handle it? We'll try to check it and fix as soon as possible

NastyBoget commented 7 months ago

There is also one notice: try to use PdfTabbyReader instead, it also works with copyable PDF documents

ValiullinAlbert commented 7 months ago

Thank you! table.pdf It didn't help me when I used need_pdf_table_analysis="True" in the parameters. The code:

from dedoc.readers import PdfTxtlayerReader, PdfTabbyReader

filepath = "table.pdf"
reader = PdfTxtlayerReader(config={"n_jobs": 1})
document = reader.read(filepath, parameters=dict(need_pdf_table_analysis="True"))
document.tables

Although when using PdfTabbyReader it works

NastyBoget commented 7 months ago

I guess the problem is that PdfTxtlayerReader extracts tables on images without using a textual layer :( As the table has colored header, table detection hasn't worked properly, and table's text moved to the document's lines :(. On the contrary, PdfTabbyReader uses a textual layer to obtain tables from PDF, and the processing was successful.

We will try to solve the problem with colored tables in the future.

Do you still have a problem?

NastyBoget commented 7 months ago

Here in PDF handling section is description of the parameter pdf_with_text_layer - if it's true, then PdfTxtlayerReader is used, if it's tabby - PdfTabbyReader is used. The description contains a little bit information about tables handling - PdfTxtlayerReader extracts tables as it PdfImageReader does (false option)

ValiullinAlbert commented 7 months ago

The problem isn't only about color headers. For example, with attached file it's still not working Table Example - Starting.pdf

NastyBoget commented 7 months ago

PdfTxtlayerReader can extract the attached image. We are aware of the problem with attached images in PdfTabbyReader and already solved it (on the develop branch). We are going to make a new release in the end of this week.

ValiullinAlbert commented 7 months ago

Sorry, I mean that attached file is my pdf file. I wanted to say that PdfTxtlayerReader cannot recognize tables without colored headers

NastyBoget commented 7 months ago

I'm sorry for misunderstanding, try to use config provided by dedoc (worked for me):

from dedoc.config import get_config
from dedoc.readers import PdfTxtlayerReader

reader = PdfTxtlayerReader(config=get_config())
document = reader.read("Table.Example.-.Starting.pdf")
print(len(document.tables))

We will fix this behavior in the future, thank you for noticing that

ValiullinAlbert commented 7 months ago

Thank you, this works!