ssimms / pdfapi2

Create, modify, and examine PDF files in Perl
Other
15 stars 21 forks source link

Appending many pages goes quadratic #77

Closed vadim-160102 closed 6 months ago

vadim-160102 commented 9 months ago

For e.g. "data merge", VDP printing and similar cases, PDF files with a few thousand pages aren't extraordinary. Content is generated and appended "on the fly" page by page; PDF::API2 can't add pages other than "one by one" anyway.

use strict;
use warnings;
use Time::HiRes 'time';
use File::Temp 'tempfile';
use PDF::API2 2.045;

my ( $fh, $tmp ) = tempfile;

for my $n ( map $_ * 1000, 1 .. 5 ) {
    my $t = time;
    my $pdf = PDF::API2-> new( file => $tmp );
    for ( 1 .. $n ) {
        $pdf-> page;
    }
    $pdf-> save;
    printf "%d\t%.2f\n", $n, time - $t
}

It prints:

1000    0.53
2000    1.95
3000    4.32
4000    8.66
5000    15.52

Compare:

use PDF::Reuse;
for my $n ( map $_ * 1000, 1 .. 5 ) {
    my $t = time;
    prFile( $tmp );
    for ( 1 .. $n ) {
        prText( 0, 0, 'text' );
        prPage;
    }
    prEnd;
    printf "%d\t%.2f\n", $n, time - $t
}

and:

1000    0.03
2000    0.06
3000    0.08
4000    0.11
5000    0.14

(Had to add at least something to every page with that module)

As Devel::NYTProf reveals, time is wasted in PDF::API2::Basic::PDF::Pages::find_page_recurse. Perhaps appending pages is common enough task, no need to constantly "find" which page to append after i.e. the last.

A patch:

--- PDF/API2/Basic/PDF/Pages orig.pm    Mon Sep 25 20:00:09 2023
+++ PDF/API2/Basic/PDF/Pages.pm Fri Feb 23 11:28:43 2024
@@ -132,15 +132,23 @@
     my ($self, $page, $page_number) = @_;
     my $top = $self->get_top();

-    $page_number = -1 unless defined $page_number and $page_number <= $top->{'Count'}->val();
+    $page_number = -1 unless defined $page_number and $page_number < $top->{'Count'}->val();

     my $previous_page;
     if ($page_number == -1) {
-        $previous_page = $top->find_page($top->{'Count'}->val() - 1);
+        $previous_page = $top->{' last_page'}
+            ? $top->{' last_page'}
+            : $top->find_page($top->{'Count'}->val() - 1);
+        $top->{' last_page'} = $page;
     }
     else {
         $page_number = $top->{'Count'}->val() + $page_number + 1 if $page_number < 0;
-        $previous_page = $top->find_page($page_number);
+        if ($top->{'Count'}->val() == scalar $top->{'Kids'}->realise->elements()) {
+            $previous_page = ($top->{'Kids'}->elements())[$page_number];
+        }
+        else {
+            $previous_page = $top->find_page($page_number);
+        }
     }

     my $parent;
@@ -180,7 +188,7 @@

     my $parent = $self;
     my $max_kids_per_parent = 8; # Why?
-    if (scalar $parent->{'Kids'}->elements() >= $max_kids_per_parent and $parent->{'Parent'} and $page_index < 1) {
+    if (scalar $parent->{'Kids'}->elements() >= $max_kids_per_parent and $parent->{'Parent'} and $page_index < 0) {
         my $grandparent = $parent->{'Parent'}->realise();
         $parent = $parent->new($parent->_pdf(), $grandparent);

Test suite runs OK; benchmark is now tolerable:

1000    0.18
2000    0.32
3000    0.49
4000    0.67
5000    0.84

In the patch, the last line changed is for https://github.com/ssimms/pdfapi2/issues/70. 2 shortcuts above it:

  1. The ' last_page' property of pages tree root is created/maintained/used. If PDF::API2 is ever going to shuffle/remove pages, this property should be invalidated by these (non-existent now) calls.
  2. Not for appending, but inserting anywhere in case of "tree" being a simple flat list (***). This shortcut could be used alone instead of (1), which would be slightly slower.

The very 1st line changed by the patch: the <= must be <. I noticed it accidentally while looking at debugging output. Seems harmless, like as if this bug is compensated/cancelled-out somewhere nearby.

(+) Please note, large positive argument to PDF::API2::page is clamped to valid range, but large negative one causes a crash with ugly error message.

(*) Further note, or rather RFC. ("Patch" is for whomever does VDP, etc. and bumps into performance issues. Slim chance it's accepted into distribution. Then "RFC" is even funnier.) I've never seen issues/slow-downs/RAM-hogs with Reader or other PDF consumer apps if pages "tree" is a flat array/list, even for tens of thousands of pages, compared to a "proper tree". The latest PDF 2.0 spec repeats after its 1990-s predecessor: "balanced tree for performance and/or limited memory". Do they just copy-paste it mindlessly, is/was this consideration relevant? Adding pages into a tree is triggered, in PDF::API2, only very rarely. E.g., benchmarking code above creates a flat list 5000 pages long. OTOH, the PDF::Reuse, also above, does** create a tree with no more than 10 branches/leaves per node. Is it worth pursuing this goal e.g. to rebuild_tree?

ssimms commented 6 months ago

Looks good to me, and as someone who frequently creates large VDP files, I'm happy to accept it.

Please note, large positive argument to PDF::API2::page is clamped to valid range, but large negative one causes a crash with ugly error message.

I noticed this as well when looking over the patch. It's now clamped to the first page if an excessively large negative number is given, mirroring what happens for an excessively large positive number.

On rebuilding the tree: no idea if it's worth doing. Off the top of my head, I'd guess it wouldn't impact performance significantly unless we're at a document size where PDF::API2 is probably going to fall over for other reasons anyway. But I haven't done any benchmarks, so that's just a guess.

vadim-160102 commented 6 months ago

Unfortunately, in your commit, unlike in proposed patch, the ' last_page' property is not updated. This can lead to funny bugs if file to be modified contains pages tree which is anything but flat list. The 'sample.pdf' from RT121911 can serve as real-life test subject, but, below, synthetic sample is generated. (It has also lead to alas unnecessary debugging efforts like "What on earth is going on now, is there yet another fix required to solve this 'let's add some pages' problem?". Thankfully, no. Not yet.)

use strict;
use warnings;
use feature 'say';

use PDF::Reuse;
use PDF::API2 2.046;

prFile( 'test.pdf' );
for my $i ( 1 .. 10 ) {
    prFontSize( 200 );
    prText( 50, 600, $i );
    prPage;
}
prEnd;

my $pdf = PDF::API2-> open( 'test.pdf' );
my $f = $pdf-> font( 'Helvetica' );
for my $i ( 11 .. 96 ) {
    my $p = $pdf-> page;
    $p-> graphics
      -> fill_color( 'red' );
    $p-> text
      -> font( $f, 200 )
      -> position( 50, 600 )
      -> text( $i );
}
$pdf-> page_mode( 'thumbnails' );
$pdf-> update;

This results in folio numbers sequence

1 .. 18, reverse 19 .. 96

but then order is restored if original patch is used or commit is fixed in obvious way. As a side note, I had to use deprecated method to 'update' a file previously opened; the 'save' wants an argument which I don't think it should.

ssimms commented 6 months ago

Fixed. Hopefully there isn't "yet another fix required" any more.