smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Changes are breaking something commit 1f4056d #359

Closed Karmel0x closed 4 years ago

Karmel0x commented 4 years ago

Unfortunately I can't provide sample PDF this time. Output is correct before commit 1f4056d.

Output with latest commit / release:

Notice: Undefined offset: 67 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 114 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 114 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 105 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 99 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 108 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 117 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 109 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 32 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 86 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 105 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 116 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 97 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 101 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 32 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 48 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 55 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 46 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 48 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091
Notice: Undefined offset: 53 in pdfparser\src\Smalot\PdfParser\Encoding\PostScriptGlyphs.php on line 1091

... more than 5000 similary errors here

text:

ż ś ą ą żą ń ą ą ą ą ą ą ą ą ę ś ę ń ą ą ą ń ę ż ą ś ś ż ą ą ż ą ą ą ż ą ę ą ń ż ń ą ą ś ą ą ą ę ż ęż ń ę ę ą ą ą Ś Ś ą ę ż żą ż ż ż ą ś ąś ń ń ń ń ą ę ś ść Ś ść ść ść ść ś ść ę ść ą ść ść ść ść ę ś ż ę ą ą ą

k00ni commented 4 years ago

It seems that there are glyphs in use which are not part of PostScriptGlyphs.php. Because this functionality is from @Connum, lets see what (s)he has to say.

Connum commented 4 years ago

It would be really great to have a sample, otherwise I'm not sure how to debug this properly. Maybe you can remove any sensitive content and just leave in a few sentences or paragraphs that cause the issue?

Connum commented 4 years ago

Also, could you please try whether the issue happens with this branch as well? https://github.com/skyfms/pdfparser/tree/fix_encoding Because I didn't implement the functionality originally, but rather adapted it to integrate it into the current state of the library. That way I could see if I introduced the breaking change myself or if the original code was already faulty.

Karmel0x commented 4 years ago

Thought after editing (I am not the author of this file) problem will disappear but it's still there. I've deleted some pages, so I can provide PDF now: cv_Zj.pdf

Problem occurs in this branch too.

k00ni commented 4 years ago

Can we add the PDF file (https://github.com/smalot/pdfparser/files/5429471/cv_Zj.pdf) to our test environment? Is it free of charge and has no obligations?

Karmel0x commented 4 years ago

I am not the author but I've removed sensitive informations, so.. probably.

k00ni commented 4 years ago

@Connum: #360 is another issue which reports a problem with PostScriptGlyphs, which was added lately. What needs to be done to fix the underlying problem?

Connum commented 4 years ago

@Connum: #360 is another issue which reports a problem with PostScriptGlyphs, which was added lately. What needs to be done to fix the underlying problem?

Sorry for not responding, I'm quite busy at the moment with several work projects in parallel, but I haven't forgotten about the issue and will look at it as soon as I find the time!

Connum commented 4 years ago

Finally got around looking into this. I just created PR #362 to handle this issue. The PDF content appears to be some kind of legal text or at least a factual report and also seems to be a small enough excerpt, so I don't see why we couldn't use it for a test case (which I did).

Please note the following, which I also added as a comment in the related test case:

Note that the "ł" in przepływu is decoded as a space character. This was already
the case before the PR that caused this issue and is not currently covered by this
test case. However, this issue should be addressed in the future and its fix
can then be incorporated into this test by uncommenting the third assertion below.
k00ni commented 4 years ago

@Karmel0x can you test again with the patch @Connum provided? Do you approve the fix?

Karmel0x commented 4 years ago

If it's reading provided pdf then it's fine and it looks like it is, original one too.