ssimms / pdfapi2

Create, modify, and examine PDF files in Perl
Other
15 stars 20 forks source link

Skip junk at end of file #80

Open justinschoeman opened 4 months ago

justinschoeman commented 4 months ago

We are encountering a large number of files in the wild with junk at the end (usually html from buggy download pages).

The current open() function in Basic/PDF/File.pm stops after the first 1kB.

The below change continues all the way to the beginning of the file (in a horribly inefficient way - 1k sliding window), but it seems to work:

    #foreach my $offset (1..64) {
    #   $fh->seek($end - 16 * $offset, 0);
    #   $fh->read($buffer, 16 * $offset);
    #   last if $buffer =~ m/startxref($cr|\s*)\d+($cr|\s*)\%\%eof.*?/i;
    #}
    my $scan_length = 16;
    my $scan_start = $end - $scan_length;
    for(;;) {
        $fh->seek($scan_start, 0);
        $fh->read($buffer, $scan_length);
        last if $buffer =~ m/startxref($cr|\s*)\d+($cr|\s*)\%\%eof.*?/i;
        last if $scan_start < 16;
        $scan_start -= 16;
        if($scan_length < 1024) { $scan_length += 16; }
    }
justinschoeman commented 4 months ago

Actually, start with $scan_length = 32. The initial 16 is pointless.

my $scan_length = 32;