unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.46k stars 250 forks source link

[BUG] Incorrect text extraction for text with kerning #524

Closed falgonua closed 8 months ago

falgonua commented 10 months ago

Description

Text extraction feature couldn't extract text which has kerning. Problem reproduced at least v3.42.0 and v3.49.0 versions

Expected Behavior

Extract text from text with kerning (text_with_kerning.pdf file)

Actual Behavior

Steps to reproduce the behavior:

  1. Try to extract text via example https://github.com/unidoc/unipdf-examples/blob/master/extract/pdf_extract_text.go from text_with_kerning.pdf

Attachments

Test files: text_with_kerning.pdf text_without_kerning.pdf

Debug: image

Example output:

yaroslav@4492:~/sources/unipdf_test$ ./test text_with_kerning.pdf 
--------------------
PDF to text extraction:
--------------------
------------------------------
Page 1:
"                                                                                              

"
------------------------------
yaroslav@4492:~/sources/unipdf_test$ ./test text_without_kerning.pdf 
--------------------
PDF to text extraction:
--------------------
------------------------------
Page 1:
"Date___________
Signature_______
Information______________________

"
------------------------------
yaroslav@4492:~/sources/unipdf_test$ 
3ace commented 10 months ago

@falgonua upon inspection, it turns out the font resource in text_without_kerning.pdf doesn't have a ToUnicode attribute used in decoding the text, so it failed to decode the text in the extraction process.

do you generate this document using unipdf or any other tools? If it's generated using unipdf, could you provide us the script and the font file you used to generate it?

falgonua commented 10 months ago

@3ace Hi. Thanks for helping. I thought that problem with kerning. I compared raw data in ex.ExtractPageText() and was difference only in kerning on such files. Origin file created by Libreoffice and then flatten by Ghostscript 10.01.1. It's looks like a problem exist after flatten. I tried to parse text_without_kerning.pdf on a few online parsers and their parse this text. (https://products.aspose.app/pdf/parser, https://www.extractpdf.com/, https://www.pdfforge.org/online/en/extract-text)

3ace commented 10 months ago

Hi @falgonua thanks for the update.

Unfortunately currently unipdf need those encoding info to properly decode the text. We'll definitely take this issue into consideration for future updates.

joinimran commented 10 months ago

Hi @falgonua - Thank you for reporting this issue. As @3ace has mentioned we will definitely take a look into this and we will consider this for future updates. If this is something of your priority to get fixed or available, please do let me know. In this case, you may need to upgrade to the paid version of the library minimum to Business Tier. Or if you want to prioritize this, you can sponsor this feature.

As of now, this is something of lower priority and we have added this to our features development list. There is no ETA when we will release this but as soon we will release this feature, we will update you.

Thanks. Imran Customer Success Manager UniDoc ehf.

falgonua commented 10 months ago

Hi @joinimran. Thank you for information. Our company use Business Standard license. Time to time one of our feature doesn't work by this bug.

joinimran commented 10 months ago

@falgonua - Can you please share your company name? You should have access to our Jira ServiceDesk Portal. All our business and enterprise customers log their support tickets via Jira Service Desk. Can you please send an email from your official email address to support@unidoc.io.

From their on, I will assign this to our development team to look into this issue. Thanks

falgonua commented 10 months ago

airSlate company

joinimran commented 10 months ago

Thanks for sharing the details. Our development team is looking into this issue and we will get back to you soon. In the mean time, Can you please email to support@unidoc.io to get access to our Jira Servicedesk so that in future you can easily report an issue.

3ace commented 9 months ago

@falgonua FYI, a fix for this issue should be included with the next update. Let us know later if you still has an issue with it.

Thanks.

sampila commented 8 months ago

Hi @falgonua,

We release new version of UniPDF https://github.com/unidoc/unipdf/releases/tag/v3.51.0 to solve this issue.

We closing this issue for now and you can re-open the issue if you are still having issue after updating to new UniPDF version.

falgonua commented 8 months ago

Thanks a lot guys, bug was fixed.