smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Trying to access array offset on value of type null (PDFObject.php line 795) #691

Closed iGrog closed 1 month ago

iGrog commented 3 months ago

Description:

An exception Trying to access array offset on value of type null was thrown on PDFObject.php line 795 $current_position_cm is null image image

PDF input

Not allowed to put pdf in public, but can share it privately.

Expected output & actual output

Expected output: to get text Actural output: Exception was thrown

Code

            $pdf = $parser->parseFile($pathToPDF);
            $texts = $pdf->getText();
GreyWyvern commented 3 months ago

My guess would be an unbalanced set of q and Q commands in the document stream causing this. But I've been wrong before! @iGrog, can you please send the offending PDF to bhuisman at greywyvern dot com? I'd appreciate a look. Thanks.

iGrog commented 3 months ago

@GreyWyvern Thanks. PDF was sent to your email

GreyWyvern commented 3 months ago

Thanks! It turns out this PDF has an inline image object which is fouling up the parser in formatContent(). The parser removes strings, but it should be removing these inline images too. I'll work on a solution for this.

GreyWyvern commented 2 months ago

@iGrog can you verify that the code from #693 resolves your issues? I've been using the "fixed" code for several weeks now and haven't had any issues myself, so I'd like to switch it out from being a draft. Thanks!

iGrog commented 2 months ago

@GreyWyvern I've checked parsing dozens of PDF files, and all of them succeeded (including those that used to crash due to NRE). Looks like it's working :) Thank you!