Closed falgonua closed 8 months ago
@falgonua upon inspection, it turns out the font resource in text_without_kerning.pdf
doesn't have a ToUnicode
attribute used in decoding the text, so it failed to decode the text in the extraction process.
do you generate this document using unipdf or any other tools? If it's generated using unipdf, could you provide us the script and the font file you used to generate it?
@3ace Hi. Thanks for helping. I thought that problem with kerning. I compared raw data in ex.ExtractPageText() and was difference only in kerning on such files. Origin file created by Libreoffice and then flatten by Ghostscript 10.01.1. It's looks like a problem exist after flatten.
I tried to parse text_without_kerning.pdf
on a few online parsers and their parse this text. (https://products.aspose.app/pdf/parser, https://www.extractpdf.com/, https://www.pdfforge.org/online/en/extract-text)
Hi @falgonua thanks for the update.
Unfortunately currently unipdf need those encoding info to properly decode the text. We'll definitely take this issue into consideration for future updates.
Hi @falgonua - Thank you for reporting this issue. As @3ace has mentioned we will definitely take a look into this and we will consider this for future updates. If this is something of your priority to get fixed or available, please do let me know. In this case, you may need to upgrade to the paid version of the library minimum to Business Tier. Or if you want to prioritize this, you can sponsor this feature.
As of now, this is something of lower priority and we have added this to our features development list. There is no ETA when we will release this but as soon we will release this feature, we will update you.
Thanks. Imran Customer Success Manager UniDoc ehf.
Hi @joinimran. Thank you for information. Our company use Business Standard license. Time to time one of our feature doesn't work by this bug.
@falgonua - Can you please share your company name? You should have access to our Jira ServiceDesk Portal. All our business and enterprise customers log their support tickets via Jira Service Desk. Can you please send an email from your official email address to support@unidoc.io.
From their on, I will assign this to our development team to look into this issue. Thanks
airSlate company
Thanks for sharing the details. Our development team is looking into this issue and we will get back to you soon. In the mean time, Can you please email to support@unidoc.io to get access to our Jira Servicedesk so that in future you can easily report an issue.
@falgonua FYI, a fix for this issue should be included with the next update. Let us know later if you still has an issue with it.
Thanks.
Hi @falgonua,
We release new version of UniPDF https://github.com/unidoc/unipdf/releases/tag/v3.51.0 to solve this issue.
We closing this issue for now and you can re-open the issue if you are still having issue after updating to new UniPDF version.
Thanks a lot guys, bug was fixed.
Description
Text extraction feature couldn't extract text which has kerning. Problem reproduced at least v3.42.0 and v3.49.0 versions
Expected Behavior
Extract text from text with kerning (text_with_kerning.pdf file)
Actual Behavior
Steps to reproduce the behavior:
Attachments
Test files: text_with_kerning.pdf text_without_kerning.pdf
Debug:![image](https://github.com/unidoc/unipdf/assets/38418746/04b00e52-652f-4507-945a-c832d102430a)
Example output: