<tocpagebreak> and import - infinite feof(), ftell(), fseek() and fread() error

dsturm commented 4 years ago

I found this bug / would like to have this new functionality

When importing pages from an external pdf AND use a table of contents (<tocpagebreak>) the generation fails in a loop of errors. I already found this entry on stackoverflow on this issue.

Debugging on this, it seems, that this also depends on the amount of entries in the toc. I.e. when I reduce these, the generation succeeds up to six entries.

This is mPDF and PHP version and environment (server/fpm/cli etc) I am using

mPDF 8.0.5

This is a PHP code snippet I use

<?php
require_once __DIR__ . '/vendor/autoload.php';

function import_page(&$mpdf, $path)
{
    $page_count = $mpdf->setSourceFile($path);
    for ($j = 1; $j <= $page_count; $j++) {
        $page_id = $mpdf->importPage($j);
        $mpdf->useTemplate($page_id);
        if ($j < $page_count) {
            $mpdf->AddPage();
        }
    }
}

$mpdf = new \Mpdf\Mpdf();
$mpdf->WriteHTML('<tocpagebreak links="1" toc-preHTML="&lt;h2&gt;Table of Contents&lt;/h2&gt;" toc-resetpagenum="1" />');

$files = scandir('../../output/');
natsort($files);

foreach ($files as $file) {
    if (strpos($file, ".pdf") !== false) { // File
        $title = substr($file, strpos($file, '.') + 2, -4);
        $mpdf->TOC_Entry(htmlspecialchars($title, ENT_QUOTES), 0);

        import_page($mpdf, '../../output/' . $file);
        $mpdf->WriteHTML('<pagebreak>');
    } elseif ($file !== "." && $file !== "..") { // Directory
        $title = substr($file, strpos($file, '.') + 2);
        $mpdf->TOC_Entry(htmlspecialchars($title, ENT_QUOTES), 0);

        $sub_files = scandir('../../output/' . $file . '/');
        natsort($sub_files);

        foreach ($sub_files as $sub_file) {
            if (strpos($sub_file, ".pdf") !== false) { // File
                $sub_title = substr($sub_file, strpos($sub_file, '.') + 2, -4);
                $mpdf->TOC_Entry(htmlspecialchars($sub_title, ENT_QUOTES), 1);

                import_page($mpdf, '../../output/' . $file . '/' . $sub_file);
                $mpdf->WriteHTML('<pagebreak>');
            }
        }
    }
}

$mpdf->Output();

mirkoiv commented 4 years ago

Problem here is in DeepCopy of TOC object and referenced objects.

(copy from stackoverflow)

By default StreamReader object is created with closeStream flag enabled when you are setting PDF source file as filename string.

    $page_count = $pdf -> SetSourceFile($path);

Problem here is in deep object cloning during TOC generation, MPDF object is cloned as well as all StreamReader objects.

When cloned TOC object goes out of scope, its garbage collected, and destructor of cloned StreamReader class is called and file handle is closed, causing invalid stream in original object.

Solution for this, is to set PDF source file as StreamReader object with closeStream flag disabled.

Using your example it should look something like this:

    function import_page($pdf, $path) {
    $fh = fopen($path, 'rb');
    $sr = new StreamReader($fh, false);
        $page_count = $pdf -> SetSourceFile(sr);
        for($j = 1; $j <= $page_count; $j++) {
            $page_id = $pdf -> ImportPage($j);
            $pdf -> UseTemplate($page_id);
            if($j < $page_count) {
                $pdf -> AddPage();
            }
        }
    }

dsturm commented 4 years ago

Hey @mirkoiv, thanks for your hints! I'll give it a try. 👍🏼

jakejackson1 commented 4 years ago

@mirkoiv thanks for the great insight on this. Do you think there's a way to resolve this permanently when using Deep Copy to clone the object? Rather than try to solve this at the user-level?

mirkoiv commented 4 years ago

Hi @jakejackson1 ,

After review of related code, I think that easiest solution would be to create an empty "__clone" method in MPDF class.

This way MPDF object will not be cloned using DeepCopy->copy() method, and cloned TOC will keep reference to the original MPDF object.

jakejackson1 commented 4 years ago

Unfortunately that isn't feasible as the Mpdf object holds a lot of data about PDF pages, their order, page numbering etc. We clone the entire object so we can correctly handle the Page Number Substitutions after the TOC(s) get merged into the document (done at the very end of processing). Doing the clone is a shortcut that fixed a bunch of bugs with the TOC functionality. Ideally, a full rewrite of the entire Page Numbering / Substitution system should be done, but that's a big undertaking.

mirkoiv commented 4 years ago

Didn't have time to get into the details how MPDF really works.

If you need cloned MPDF object, than you can disable cloning of StreamReader objects.

You can ask someone from FPDI project to add an empty "clone" method to StreamReader class, or you can extend existing StreamReader class with just this new empty "clone" method and use it in MPDF instead of original StreamReader class.

jakejackson1 commented 4 years ago

Thanks for the advice!

Assuming there are no unintended side effects, extending the StreamReader class in Mpdf would probably be the best way to go.

mpdf / mpdf