Closed Karmel0x closed 4 years ago
It seems that there are glyphs in use which are not part of PostScriptGlyphs.php
. Because this functionality is from @Connum, lets see what (s)he has to say.
It would be really great to have a sample, otherwise I'm not sure how to debug this properly. Maybe you can remove any sensitive content and just leave in a few sentences or paragraphs that cause the issue?
Also, could you please try whether the issue happens with this branch as well? https://github.com/skyfms/pdfparser/tree/fix_encoding Because I didn't implement the functionality originally, but rather adapted it to integrate it into the current state of the library. That way I could see if I introduced the breaking change myself or if the original code was already faulty.
Thought after editing (I am not the author of this file) problem will disappear but it's still there. I've deleted some pages, so I can provide PDF now: cv_Zj.pdf
Problem occurs in this branch too.
Can we add the PDF file (https://github.com/smalot/pdfparser/files/5429471/cv_Zj.pdf) to our test environment? Is it free of charge and has no obligations?
I am not the author but I've removed sensitive informations, so.. probably.
@Connum: #360 is another issue which reports a problem with PostScriptGlyphs
, which was added lately. What needs to be done to fix the underlying problem?
@Connum: #360 is another issue which reports a problem with
PostScriptGlyphs
, which was added lately. What needs to be done to fix the underlying problem?
Sorry for not responding, I'm quite busy at the moment with several work projects in parallel, but I haven't forgotten about the issue and will look at it as soon as I find the time!
Finally got around looking into this. I just created PR #362 to handle this issue. The PDF content appears to be some kind of legal text or at least a factual report and also seems to be a small enough excerpt, so I don't see why we couldn't use it for a test case (which I did).
Please note the following, which I also added as a comment in the related test case:
Note that the "ł" in przepływu is decoded as a space character. This was already
the case before the PR that caused this issue and is not currently covered by this
test case. However, this issue should be addressed in the future and its fix
can then be incorporated into this test by uncommenting the third assertion below.
@Karmel0x can you test again with the patch @Connum provided? Do you approve the fix?
If it's reading provided pdf then it's fine and it looks like it is, original one too.
Unfortunately I can't provide sample PDF this time. Output is correct before commit 1f4056d.
Output with latest commit / release:
text:
ż ś ą ą żą ń ą ą ą ą ą ą ą ą ę ś ę ń ą ą ą ń ę ż ą ś ś ż ą ą ż ą ą ą ż ą ę ą ń ż ń ą ą ś ą ą ą ę ż ęż ń ę ę ą ą ą Ś Ś ą ę ż żą ż ż ż ą ś ąś ń ń ń ń ą ę ś ść Ś ść ść ść ść ś ść ę ść ą ść ść ść ść ę ś ż ę ą ą ą