smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Process Multiple Files #210

Open Bjortin opened 6 years ago

Bjortin commented 6 years ago

Hi,

I've got folder on the server with several pdf files (all of them using the same template). I've been trying hard to figure out how I could parse all of them in one run by PDF Parser, but PDF Parser just outputs data for the first file. It's like you are not allowed process one file and then continue with the next one in a foreach loop. I've tried to initiate a new instance of the Parser within the foreach for each file, also by reusing the same Parser for all files. Consider the following example where the Parser is initiated outside of the foreach. How would I go about processing all files at once?

My goal is to fetch data from several reports, union the data into a json object and use for import in other applications.

error_reporting(E_ALL); ini_set("display_errors", 1);

include 'vendor/autoload.php';

$report_files = glob(dirname(FILE) .'/reports/*.pdf');

// Parse pdf file and build necessary objects. $parser = new \Smalot\PdfParser\Parser();

foreach($report_files as $file) { // $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($file);

// Retrieve all pages from the pdf file.
$pages = $pdf->getPages();

// Loop over each page to extract text.
foreach ($pages as $page)
{
    echo $page->getText();
    echo "<hr>";
}

}

pvggth commented 6 years ago
error_reporting(E_ALL);
ini_set("display_errors", 1);

include 'vendor/autoload.php';

$report_files = glob(dirname(__FILE__) .'/reports/*.pdf');

// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();

foreach($report_files as $file) {
    // $parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($file);

    // Retrieve all pages from the pdf file.
    $pages = $pdf->getPages();

    // Loop over each page to extract text.
    foreach ($pages as $page) {
        echo $page->getText();
        echo "<hr>";
    }
}

Seems like the same way I parse multiple files and should work. Note the __FILE__ change.

rubenvanerk commented 4 years ago

@Bjortin is your question answered?