izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
97 stars 45 forks source link

python process getting crashed on a particular pdf #8

Closed aloknayak29 closed 6 years ago

aloknayak29 commented 6 years ago

Hi On some cases, library is crashing the whole process, without giving any error or exception. For example, on a certain page e.g 25, process crashed. d = poppler.Document('sample.pdf')

for i in range(1,d.no_of_pages+1): page = d.get_page(i) print i textlines = [(tline.text.encode('UTF-8'), tline.bbox.as_tuple()) for flow in page for block in flow for tline in block]

The same file is crashing, while I try to open it in ubuntu's default pdf viewer. Although gmail's pdf viewer is able to open it.

oplahcinski commented 6 years ago

Can you upload the file and the full stack traceback?

What versions of python, poppler lib, cython do you have installed?

aloknayak29 commented 6 years ago

Trial.1_dp.pdf

No stack traceback. No excpetion or error was raised. It would be nice if this could throw error or exception.

Python 2.7.14 |Anaconda, Inc. poppler version 0.41.0 Cython==0.26.1

izderadicka commented 6 years ago

If it fails without stacktrace then it is probaly in libpoppler - try latest libpoppler as 0.41 is bit older. I'll try doc in my environment.

izderadicka commented 6 years ago

Confirming segfault in libpoppler - I did not use latest version, because they changed build process to cmake (will have to change build script later). But I used quite recent 0.59 from September and got this error:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff685fcc4 in TextWord::getColor (b=<optimized out>, g=<optimized out>, r=<optimized out>, this=<optimized out>) at ./poppler_src/poppler/TextOutputDev.h:174
174     { *r = colorR; *g = colorG; *b = colorB; }

This is complete stack trace:

#0  0x00007ffff685fcc4 in TextWord::getColor (b=<optimized out>, g=<optimized out>, r=<optimized out>, this=<optimized out>) at ./poppler_src/poppler/TextOutputDev.h:174
#1  __pyx_pf_9pdfparser_7poppler_4Line_4_get_text (__pyx_v_self=<optimized out>) at pdfparser/poppler.cpp:8548
#2  __pyx_pw_9pdfparser_7poppler_4Line_5_get_text (__pyx_v_self=<optimized out>, unused=<optimized out>) at pdfparser/poppler.cpp:8312
#3  0x00007ffff68616ae in __Pyx_PyObject_CallMethO (arg=0x0, func=<optimized out>) at pdfparser/poppler.cpp:12066
#4  __Pyx_PyObject_CallNoArg (func=<optimized out>) at pdfparser/poppler.cpp:12091
#5  0x00007ffff6861af8 in __pyx_pf_9pdfparser_7poppler_4Line_2__init__ (__pyx_v_block=<optimized out>, __pyx_v_self=0x7ffff7eca670) at pdfparser/poppler.cpp:8238
#6  __pyx_pw_9pdfparser_7poppler_4Line_3__init__ (__pyx_v_self=0x7ffff7eca670, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at pdfparser/poppler.cpp:8115
#7  0x00000000004b670c in ?? ()
#8  0x00007ffff6862a83 in __Pyx_PyObject_Call (kw=0x0, arg=0x7ffff7e8a9d0, func=0x7ffff6a6de80 <__pyx_type_9pdfparser_7poppler_Line>) at pdfparser/poppler.cpp:11926
#9  __pyx_pf_9pdfparser_7poppler_5Block_4__next__ (__pyx_v_self=0x7ffff7f60170) at pdfparser/poppler.cpp:3817
#10 __pyx_pw_9pdfparser_7poppler_5Block_5__next__ (__pyx_v_self=0x7ffff7f60170) at pdfparser/poppler.cpp:3757
#11 0x00000000004c4c6f in PyEval_EvalFrameEx ()
#12 0x00000000004c2765 in PyEval_EvalCodeEx ()
#13 0x00000000004c2509 in PyEval_EvalCode ()
#14 0x00000000004f1def in ?? ()
#15 0x00000000004ec652 in PyRun_FileExFlags ()
#16 0x00000000004eae31 in PyRun_SimpleFileExFlags ()
#17 0x000000000049e14a in Py_Main ()
#18 0x00007ffff7810830 in __libc_start_main (main=0x49dab0 <main>, argc=3, argv=0x7fffffffdb28, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdb18)
    at ../csu/libc-start.c:291
#19 0x000000000049d9d9 in _start ()

and this is cpp cython code:

    /* "pdfparser/poppler.pyx":465
 * 
 *                 self._bboxes.append(last_bbox)
 *                 w.getColor(&r, &g, &b)             # <<<<<<<<<<<<<<
 *                 last_font=FontInfo(w.getFontName(i).getCString().decode('UTF-8'),
 *                                    w.getFontSize(),
 */
      __pyx_v_w->getColor((&__pyx_v_r), (&__pyx_v_g), (&__pyx_v_b));

as per now no idea what's wrong there, any hints welcomed

izderadicka commented 6 years ago

Problem is in w.getColor and w.getFont... - which for some reason cause segfault in this case. I pushed branch issue_8, where they are commented. Now pdfparser works, but color and font info is not available (dummy values).

gitamadb commented 6 years ago

Hi, How can I help ? Is it possible to catch the error and only return dummy values in case of segfault ?

izderadicka commented 6 years ago

No segfault = segmentation fault - meaning that code has done some illegal operation (usually accessing memory outside of program space) and it is immediately terminated by OS. So there is no chance to recover from it. Problem is somewhere is libpoppler C++ code - so solution would be found exactly why it is failing and push issue to its maintainers.

gitamadb commented 6 years ago

That's what I thought too, but there seems to be ways to catch segmentation fault in linux, https://stackoverflow.com/questions/2350489/how-to-catch-segmentation-fault-in-linux.

There seems to be a mailing list here : https://lists.freedesktop.org/mailman/listinfo/poppler. I'll try to post the issue, and take a look at the C++ libpoppler code.

mhammerc commented 6 years ago

Hello!

I just checked the code searching for a fix. I am on MacOS, then I installed poppler through brew: I got Poppler 0.62.0.

First thing I notice, headers file are completely different between 0.62.0 and their master branch on Github.

File https://github.com/danigm/poppler/blob/master/poppler/TextOutputDev.h from master branch:

140 |  TextFontInfo *getFontInfo() { return font; }
149 |  GooString *getFontName() { return font->fontName; }

File TextOutputDev.h from 0.62.0:

163 |  TextFontInfo *getFontInfo(int idx) { return font[idx]; }
172 |  GooString *getFontName(int idx) { return font[idx]->fontName; }

It looks like in older versions, each character on a word have an assigned font. But on newer version, each character of a word have the same font. It will be a thing to change if you want to upgrade.

I traced the segfault, and it appears the font name is nonexistent. The segfault doesn't come from poppler but from pdfparser code. Maybe it is due to a bug from poppler but it happen in pdfparser python code. On the sample pdf @aloknayak29 gave us, poppler isn't able to properly read the fonts. As a result, everything is fine but the fontname is NULL.

On line 473 https://github.com/izderadicka/pdfparser/blob/master/pdfparser/poppler.pyx#L473 :

last_font=FontInfo(w.getFontName(i).getCString().decode('UTF-8', 'replace'),

w.getFontName(i) return a NULL pointer. So we can not call getCString() on NULL, naturally.

On issue_8 branch, you disabled the font_name, and set unknown for all cases. A more comprehensive fix would be this one: https://github.com/mhammerc/pdfparser/commit/0ad94307da7942417633aaf683223c347b52f461 On regular PDF files, the font name is still fine, on buggy PDF files, the font name will be unknown. This fix works for the buggy example given above. I let you try more buggy example if you have some.

The true question is: is the NULL font name a bug from poppler or somehow the font really has no name? If it is actually a bug from poppler, this fix is only a workaround.

Anyway that "unknown" font is still pertinent and may be useful to some!

izderadicka commented 6 years ago

Thanks to @mhammerc this should be fixed now.