Closed ValiullinAlbert closed 7 months ago
Try to use need_pdf_table_analysis="True"
in the parameters
, e.g.:
reader = PdfTxtlayerReader(config={})
document = reader.read(file_path, parameters=dict(need_pdf_table_analysis="True")
If this doesn't help, could you please send us the document and the code you are using to handle it? We'll try to check it and fix as soon as possible
There is also one notice: try to use PdfTabbyReader
instead, it also works with copyable PDF documents
Thank you!
table.pdf
It didn't help me when I used need_pdf_table_analysis="True"
in the parameters
.
The code:
from dedoc.readers import PdfTxtlayerReader, PdfTabbyReader
filepath = "table.pdf"
reader = PdfTxtlayerReader(config={"n_jobs": 1})
document = reader.read(filepath, parameters=dict(need_pdf_table_analysis="True"))
document.tables
Although when using PdfTabbyReader
it works
I guess the problem is that PdfTxtlayerReader extracts tables on images without using a textual layer :( As the table has colored header, table detection hasn't worked properly, and table's text moved to the document's lines :(. On the contrary, PdfTabbyReader
uses a textual layer to obtain tables from PDF, and the processing was successful.
We will try to solve the problem with colored tables in the future.
Do you still have a problem?
Here in PDF handling
section is description of the parameter pdf_with_text_layer
- if it's true
, then PdfTxtlayerReader
is used, if it's tabby
- PdfTabbyReader
is used. The description contains a little bit information about tables handling - PdfTxtlayerReader
extracts tables as it PdfImageReader
does (false
option)
The problem isn't only about color headers. For example, with attached file it's still not working Table Example - Starting.pdf
PdfTxtlayerReader
can extract the attached image. We are aware of the problem with attached images in PdfTabbyReader
and already solved it (on the develop
branch). We are going to make a new release in the end of this week.
Sorry, I mean that attached file is my pdf file. I wanted to say that PdfTxtlayerReader
cannot recognize tables without colored headers
I'm sorry for misunderstanding, try to use config provided by dedoc (worked for me):
from dedoc.config import get_config
from dedoc.readers import PdfTxtlayerReader
reader = PdfTxtlayerReader(config=get_config())
document = reader.read("Table.Example.-.Starting.pdf")
print(len(document.tables))
We will fix this behavior in the future, thank you for noticing that
Thank you, this works!
When I read a document using PdfTxtlayerReader, there are no tables even though the document contains some. This document is copyable. Could you tell me what I am doing wrong?