smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.35k stars 536 forks source link

Problems after trying to extract broken file text #288

Open uginroot opened 4 years ago

uginroot commented 4 years ago

Problems:

Broken file

Example:

function pdfToText(string $path):?string
{
    $content = file_get_content($path);
    $parser = new Parser();

    try{
        return $parser->parseContent($content)->getText();
    } catch (Exception $exception){
        return null;
    }
}
k00ni commented 4 years ago

Can you tell us what do you expect and what actually happens?

You pasted a try-catch, so i assume it raises an exception and returns null? Or does it runs into a fatal error?

uginroot commented 4 years ago

The problem is that after an exception the output stops working and the application starts consuming a lot of memory.

The solution that helped me temporarily cope with this problem:

function pdf2text($path): ?string
{
    $content = file_get_contents($path);

    if(strpos($content, '%PDF') !== 0){
        return null;
    }

    try{
        $parser = new Parser();
        return $parser->parseContent($content)->getText();
    } catch (\Exception $exception){
        return null;
    }
}