PDF text extraction issue

ekuleshov commented 1 year ago

I'm trying to use the latest version of syncfusion_flutter_pdf to extract lines of text from PDF files using the following code:

    List<TextLine> lines = PdfTextExtractor(document).extractTextLines(startPageIndex: 0, endPageIndex: 0);
    for (TextLine line in lines) {
      print(line.text);
    }

It pulls all the text fine, however for a table-formatted data it skips any field separators.

So I get this output:

33,754355150010114,65
42,615280160000030,59
91,715485190010049,55
101,59313552100021,97
112,4938100200000040,10
122,616075150000028,67
132,513475150000029,84
144,1152408000009,72
153,116580160000025,67

and the same look like this in the original PDF:

I wonder if it would be possible to add some separator between individual fields (e.g. \t or similar) to preserve structure of the data. E.g. the Apache PdfBox inserts spaces between these fields.

ted-marozzi commented 1 year ago

If you use the extractText method there is a layout boolean. Maybe it does what you want.

String lines = PdfTextExtractor(document).extractText(layout: true);
print(lines);

I didn't test this. Just guessing. Also may want to look at this linux util: https://manpages.debian.org/experimental/poppler-utils/pdftotext.1.en.html

It has the most powerful pdf text extraction engine that I have found so far. I use it in an application by making an os call and reading the txt file. It has a -layout option for this purpose.

GowthamrajKumar25 commented 1 year ago

Hi ekuleshov,

It is not possible to extract lines with spaces using "extractTextLines" when all the text in the same line was rendered at different time. To get all the values in the lines separately, we can use below code snippet on your end and let us know if it is satisfies your requirement or not.

List<TextLine> textlines = PdfTextExtractor(document).extractTextLines(startPageIndex: 0); for (TextLine line in textlines) { List<TextWord> textWords = line.wordCollection; for (TextWord word in textWords) { print(word.text); } }

To get the layout of the page, we can use "extractText", extractor.extractText(startPageIndex: 0, layoutText: true);

Please refer the below documentation link, https://www.syncfusion.com/kb/11967/how-to-extract-text-from-a-pdf-file-in-syncfusion-flutter-pdf-library

Please let us know if you need any further assistance in this.

Regards, Gowthamraj K

ekuleshov commented 1 year ago

To get all the values in the lines separately, we can use below code snippet on your end and let us know if it is satisfies your requirement or not.

I tried your example and all text from these lines is returned as a single word, which is not what I can use for processing:

33,754355150010114,65
42,615280160000030,59
91,715485190010049,55
101,59313552100021,97
112,4938100200000040,10
122,616075150000028,67
132,513475150000029,84
144,1152408000009,72
153,116580160000025,67

It seems like the problem is that the PdfTextExtractor is gluing together all text on the same line that is rendered with the same font (or something like that). It needs to somehow take into account location of the text and maybe insert some separator between each group.

senvB commented 1 year ago

I am facing the same issue when reading data from a table. The table row is concatenated into one TextWord. This makes it impossible to differentiate single words from the document and get their position. Using extractText and layout set to true adds at least some space. But when using extractTextLines this option is not available. Any chance to look into this issue?

irfanajaffer commented 1 year ago

We suspect that the issue may be due to specific PDF document.

Kindly provide us with the issue PDF for analyzing further in this.

senvB commented 1 year ago

Thanks for picking this up. You can find a PDF which is resulting in a single word per textile using this URL: https://www.dsab-vfs.de/VFSProject/WebObjects/VFSProject.woa/wa/rangListen?liga=5703&typ=partienplanPDF&saison=3304

ekuleshov commented 1 year ago

@irfanajaffer @GowthamrajKumar25 Any update on this?

For this file 20220828.pdf When I use the PdfTextExtractor(document).extractText(..) with parameter layoutText set to false, it returns every column from PDF on a new row:

But with the layoutText parameter set to true the result look like this, i.e. all column data are crumbled together and extracted text doesn't give any way to separate columns:

114221,226,6918160,0000100,00   
214622,196,5795157,315998,32    
...

Either way it is hard to work with this data.

Looking at the PDF extractor code:

https://github.com/syncfusion/flutter-widgets/blob/4e8f8c73cfeaf2248ed05a88d7fe85d2e04c91eb/packages/syncfusion_flutter_pdf/lib/src/pdf/implementation/exporting/pdf_text_extractor/pdf_text_extractor.dart#L1430-L1436

If I change line 1432 to something like this:

resultantText += currentText! + wordSeparator; // add word separator such as '\t'

Then I get the following desired result.

1   142 21,22   6,6918  160,0000    100,00  
2   146 22,19   6,5795  157,3159    98,32   
...

Perhaps you could implement a simple fix and allow to pass an optional wordSeparator argument the the PdfTextExtractor(document).extractText(..) method.

ekuleshov commented 1 year ago

@GowthamrajKumar25 @irfanajaffer any chance this issue could be addressed with the above fix or enhancement option?

irfanajaffer commented 1 year ago

In the provided PDF document “20220828.pdf”, the PDF contains embedded data that is not possible to layout without retrieving the respective font data. So kindly use extractTextLine() method to layout the text instead of extractText(layout).

Kindly use the following simple code for layouting using extractTextLine() method,

PdfDocument document = PdfDocument(inputBytes: inputBytes); PdfTextExtractor extractor = PdfTextExtractor(document); String extractedText = ''; List textLines = extractor.extractTextLines(); for (int i = 0; i < textLines.length; i++) { TextLine line = textLines[i]; for (int j = 0; j < line.wordCollection.length; j++) { extractedText += ' ${line.wordCollection[j].text}'; } if (i + 1 < textLines.length && (line.bounds.top - textLines[i + 1].bounds.top).abs() > 1) { extractedText += '\r\n'; } } print(extractedText); document.dispose(); --

In document “Spielplan.pdf”, We confirmed the issue ”Extract text line words are not split properly for specific document” as a defect in our product and fix will be included in our upcoming weekly release, which will be available on March 28, 2023.

Use the below feedback link to track the status of the reported bug.

https://www.syncfusion.com/feedback/42104/extract-text-line-words-are-not-split-properly-for-specific-document

Disclaimer: “Inclusion of this solution in the weekly release may change due to other factors including but not limited to QA checks and works reprioritization.”

ekuleshov commented 1 year ago

@irfanajaffer I tried your suggestion. It does work on the example I posted above, but it still does not produce separated fields in several other examples I tried.

In comparison, the change I suggested above does produce separated text values in all my test PDFs.

Unfortunately those PDFs have some PII data and I'm not comfortable sharing them in public. But if it would help to address this issue I can email them to you.

On a side note, when trying to extract text from PDFs generated with GhostScript printer, the extracted text has 0x00 values after every character in the result text. I can clean it up in post-processing, but it would be nice if PDF library did that.

irfanajaffer commented 1 year ago

The reported problem ”Extract text line words are not split properly for the specific document” is resolved in our latest flutter package release version 21.1.37.

Please refer to the package below:

https://pub.dev/packages/syncfusion_flutter_pdf/versions/21.1.37

@ekuleshov , We understand that the security and confidentiality of your files are of utmost importance to you, and we want to assure you that your files are safe with us.

We have robust security measures in place to ensure that your files are protected from unauthorized access, theft, or loss. Additionally, we adhere to strict privacy policies and industry best practices to safeguard your data.

Once the analysis is completed, we will promptly delete your files from our systems. We do not retain any customer data beyond what is necessary to complete the requested service.

We take great pride in our commitment to providing secure and reliable services to our customers. Should you have any further questions or concerns, please do not hesitate to contact us.

ekuleshov commented 1 year ago

@irfanajaffer I will check my PDF files with the last release. Thank you.

Regarding sharing them with you, I don't have concerns with you retaining them for testing, but do have concerns with attaching them to this public issue tracker. Like I said, if it would help to diagnose issues, I can email them to you, but can't post/attach them here.

irfanajaffer commented 1 year ago

Thank you for your willingness to share the files with me for testing purposes. I completely understand your concerns about attaching them to a public issue tracker, and I appreciate your offering to email them to me instead. Rest assured that any files you share with me will be kept confidential and used solely for the purpose of diagnosing and resolving the issues you have reported.

Please send the files to my email address [support@syncfusion.com] at your earliest convenience, and I will let you know once I have received them.

Thank you again for your cooperation and for helping us improve our service

syncfusion / flutter-widgets

PDF text extraction issue #775