Closed ekuleshov closed 1 year ago
If you use the extractText method there is a layout boolean. Maybe it does what you want.
String lines = PdfTextExtractor(document).extractText(layout: true);
print(lines);
I didn't test this. Just guessing. Also may want to look at this linux util: https://manpages.debian.org/experimental/poppler-utils/pdftotext.1.en.html
It has the most powerful pdf text extraction engine that I have found so far. I use it in an application by making an os call and reading the txt file. It has a -layout option for this purpose.
Hi ekuleshov,
It is not possible to extract lines with spaces using "extractTextLines" when all the text in the same line was rendered at different time. To get all the values in the lines separately, we can use below code snippet on your end and let us know if it is satisfies your requirement or not.
List<TextLine> textlines = PdfTextExtractor(document).extractTextLines(startPageIndex: 0); for (TextLine line in textlines) { List<TextWord> textWords = line.wordCollection; for (TextWord word in textWords) { print(word.text); } }
To get the layout of the page, we can use "extractText",
extractor.extractText(startPageIndex: 0, layoutText: true);
Please refer the below documentation link, https://www.syncfusion.com/kb/11967/how-to-extract-text-from-a-pdf-file-in-syncfusion-flutter-pdf-library
Please let us know if you need any further assistance in this.
Regards, Gowthamraj K
To get all the values in the lines separately, we can use below code snippet on your end and let us know if it is satisfies your requirement or not.
I tried your example and all text from these lines is returned as a single word, which is not what I can use for processing:
33,754355150010114,65
42,615280160000030,59
91,715485190010049,55
101,59313552100021,97
112,4938100200000040,10
122,616075150000028,67
132,513475150000029,84
144,1152408000009,72
153,116580160000025,67
It seems like the problem is that the PdfTextExtractor
is gluing together all text on the same line that is rendered with the same font (or something like that). It needs to somehow take into account location of the text and maybe insert some separator between each group.
I am facing the same issue when reading data from a table. The table row is concatenated into one TextWord. This makes it impossible to differentiate single words from the document and get their position. Using extractText and layout set to true adds at least some space. But when using extractTextLines this option is not available. Any chance to look into this issue?
We suspect that the issue may be due to specific PDF document.
Kindly provide us with the issue PDF for analyzing further in this.
Thanks for picking this up. You can find a PDF which is resulting in a single word per textile using this URL: https://www.dsab-vfs.de/VFSProject/WebObjects/VFSProject.woa/wa/rangListen?liga=5703&typ=partienplanPDF&saison=3304
@irfanajaffer @GowthamrajKumar25 Any update on this?
For this file 20220828.pdf
When I use the PdfTextExtractor(document).extractText(..)
with parameter layoutText
set to false
, it returns every column from PDF on a new row:
1
142
21,22
6,6918
160,0000
100,00
2
146
22,19
6,5795
157,3159
98,32
...
But with the layoutText
parameter set to true
the result look like this, i.e. all column data are crumbled together and extracted text doesn't give any way to separate columns:
114221,226,6918160,0000100,00
214622,196,5795157,315998,32
...
Either way it is hard to work with this data.
Looking at the PDF extractor code:
If I change line 1432 to something like this:
resultantText += currentText! + wordSeparator; // add word separator such as '\t'
Then I get the following desired result.
1 142 21,22 6,6918 160,0000 100,00
2 146 22,19 6,5795 157,3159 98,32
...
Perhaps you could implement a simple fix and allow to pass an optional wordSeparator
argument the the PdfTextExtractor(document).extractText(..)
method.
@GowthamrajKumar25 @irfanajaffer any chance this issue could be addressed with the above fix or enhancement option?
In the provided PDF document “20220828.pdf”, the PDF contains embedded data that is not possible to layout without retrieving the respective font data. So kindly use extractTextLine() method to layout the text instead of extractText(layout).
Kindly use the following simple code for layouting using extractTextLine() method,
PdfDocument document = PdfDocument(inputBytes: inputBytes); PdfTextExtractor extractor = PdfTextExtractor(document); String extractedText = ''; ListIn document “Spielplan.pdf”, We confirmed the issue ”Extract text line words are not split properly for specific document” as a defect in our product and fix will be included in our upcoming weekly release, which will be available on March 28, 2023.
Use the below feedback link to track the status of the reported bug.
Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”
@irfanajaffer I tried your suggestion. It does work on the example I posted above, but it still does not produce separated fields in several other examples I tried.
In comparison, the change I suggested above does produce separated text values in all my test PDFs.
Unfortunately those PDFs have some PII data and I'm not comfortable sharing them in public. But if it would help to address this issue I can email them to you.
On a side note, when trying to extract text from PDFs generated with GhostScript printer, the extracted text has 0x00 values after every character in the result text. I can clean it up in post-processing, but it would be nice if PDF library did that.
The reported problem ”Extract text line words are not split properly for the specific document” is resolved in our latest flutter package release version 21.1.37.
Please refer to the package below:
https://pub.dev/packages/syncfusion_flutter_pdf/versions/21.1.37
@ekuleshov , We understand that the security and confidentiality of your files are of utmost importance to you, and we want to assure you that your files are safe with us.
We have robust security measures in place to ensure that your files are protected from unauthorized access, theft, or loss. Additionally, we adhere to strict privacy policies and industry best practices to safeguard your data.
Once the analysis is completed, we will promptly delete your files from our systems. We do not retain any customer data beyond what is necessary to complete the requested service.
We take great pride in our commitment to providing secure and reliable services to our customers. Should you have any further questions or concerns, please do not hesitate to contact us.
@irfanajaffer I will check my PDF files with the last release. Thank you.
Regarding sharing them with you, I don't have concerns with you retaining them for testing, but do have concerns with attaching them to this public issue tracker. Like I said, if it would help to diagnose issues, I can email them to you, but can't post/attach them here.
Thank you for your willingness to share the files with me for testing purposes. I completely understand your concerns about attaching them to a public issue tracker, and I appreciate your offering to email them to me instead. Rest assured that any files you share with me will be kept confidential and used solely for the purpose of diagnosing and resolving the issues you have reported.
Please send the files to my email address [support@syncfusion.com] at your earliest convenience, and I will let you know once I have received them.
Thank you again for your cooperation and for helping us improve our service
I'm trying to use the latest version of
syncfusion_flutter_pdf
to extract lines of text from PDF files using the following code:It pulls all the text fine, however for a table-formatted data it skips any field separators.
So I get this output:
and the same look like this in the original PDF:
I wonder if it would be possible to add some separator between individual fields (e.g.
\t
or similar) to preserve structure of the data. E.g. the Apache PdfBox inserts spaces between these fields.