izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
97 stars 45 forks source link

Segmentation fault #25

Closed nghiapq77 closed 4 years ago

nghiapq77 commented 4 years ago
import pdfparser.poppler as pdf
doc = pdf.Document(b'40.pdf')
for page in doc:
    for flow in page:
        for block in flow:
            for line in block:
                print(line.text)

40.pdf

And I got Segmentation fault (core dumped) Any help would be appreciated, thank you

Edit 1: I figured this has something to do with the font

izderadicka commented 4 years ago

Please provider your platform, python version, libpoppler version, is this document only one causing prpblem?

nghiapq77 commented 4 years ago

Please provider your platform, python version, libpoppler version, is this document only one causing prpblem?

So far it's the only one. Im using Ubuntu 18.04, python 3.6.9 and libpoppler 0.62.0

izderadicka commented 4 years ago

Looks like it's regression from latest commit c6357b2 , previous commit seems to works on this file. Latest version has segfault problem on given file - just after 15. SHIPPING MARK header. The text below has strange font, so could be related to some fonts issue as suggested.

Will need to look into in in more detail.

Can you please in meanwhile check if previous commit works for you too?

@DainisGorbunovs - can you please check if above referred file fails in your build too, thanks.

izderadicka commented 4 years ago

Ok so problem was when libpoppler returns NULL as font name - in latest change it was not handled properly, thus in those cases we got segfault. I've tried to fix in fa04b07 - confirm please that it does solve your issue.

DainisGorbunovs commented 4 years ago

Thanks for reporting this issue, and thanks for tagging me.

I was looking into it, and can confirm that the regression is due to c6357b2.

@izderadicka: your commit fa04b07 successfully fixes the issue, and am able to parse 40.pdf.