Closed ryankilroy closed 3 months ago
Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/
Hi @ryankilroy , thank you for reporiting this issue. We were able to reproduce it using the sample code and sample file you provided and we are currently investigating the cause of it. We will write an update as soon as we identify the source of the issue and the fixes.
Hi @ryankilroy, after some investigation, we found out that the issue is in the ToUnicode
map provided in the document. It has an invalid code point for the character code that represented the missing letter (l
). But the reason other tools were able to extract the correct character is that they resorted to the Replacement Text
data provoded as part of the marked content. Currently, our extractor doesn't implement this feature, which is why it just took the invalid code point (which is by the way in the Private Use Area of Unicode ) and extracted it as valid text. We plan to incorporate this feature in the future and provide an update on this ticket upon its release.
Regarding your second issue, i.e., font extraction, the reason for the font extraction failure is that there is no font in pages 3 and beyond (because the pages are scanned). But the error message is not informative enough to convey this. We will update this one too.
Hi @ryankilroy , This issue is fixed in the new release (v3.60.0
) which can be found here https://github.com/unidoc/unipdf/releases/tag/v3.60.0. Closing this ticket as fixed.
Description
When I attempt to extract the text from a pdf with certain embedded fonts, it returns some missing rune characters. The fonts don't seem to throw errors on the first page (which still has missing runes), but when I attempt to extract the fonts from the later pages in the pdf, I get some
Can't convert font object, invalid type
errors.Expected Behavior
I expect to be able to extract usable text from the pdf
Actual Behavior
Extracting text from the pdf results in missing runes
Steps to reproduce the behavior:
If you instead run
pdftotext <file.pdf> -
against it, the text is fully readableAttachments
Sample PDF.pdf
Examples
There are more missing runes in areas of the actual pdf, but I couldn't replicate them with the anonymized data. Here are some of the examples