smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Allowed memory exhausted when parse the PDF file. #631

Open durifal opened 11 months ago

durifal commented 11 months ago

Description:

Trying to parse this PDF always result in Allowed memory exhausted error.

Error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ...../smalot/pdfparser/src/Smalot/PdfParser/Font.php, line 223

Set up PHP memory limit to 4GB did not help either. I have also tried to setDecodeMemoryLimit to lower but still had the same memory issue. Setting Decode memory limit prevent the error only when I set it to 1000 or lower. So maybe it should be set in MB and not in bytes, or there is an bug in the code.

PDF input

test_pdf.pdf

Expected output & actual output

Parser should either parse the text from the PDF, or return empty string or some exception and not memory error.

Code

$config = new Config();
$url = 'path_to_PDF_folder/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);
k00ni commented 11 months ago

Thanks for reporting. What program did you use to generate the PDF? To be sure the error still exist, please try again with latest master branch.

durifal commented 11 months ago

I do not know what generated the PDF, because visitors of our sites uploaded it as Cover letter, which we try to parse so full-text would search also in attachment. I have just edit the PDF in Adobe PDF editor to anonymize data.

We hit this problem multiple times during parsing the PDFs, so if necessary I can anonymize more examples. But it is pretty rare (about 10 PDFs out of 1 000 000). All of them had on one site text on some background color.

I have tested problematic PDF with the same result also on master branch:

Fatal error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ........../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 230

k00ni commented 11 months ago

Thank you for the feedback.

denydias commented 8 months ago

A similar issue also hit me. I'll post this here as this looks like a common unhandled exception, but let me know if you need an specific issue. Just like the OP, a small portion of a much larger batch appears to be affected.

As for the PDF creator:

Creator: Adobe Acrobat 7.0
Producer: Adobe Acrobat 7.0 Paper Capture Plug-in

PdfParser exception:

[2023-10-21 07:48:18] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}
[2023-10-21 07:48:20] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) {
  "userId":2,"exception":"[object] (
    Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
    Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) at
    vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
  )
  [stacktrace]
  #0 {main}"
}
k00ni commented 8 months ago

@denydias can you provide your PDFs, which cause this exception?

Also, try #634 and check if the exception remains.

denydias commented 8 months ago

Thank you for the quick reply, @k00ni! I'll try the PR and let you know the results. Please expect some delay as these are very busy days here.

denydias commented 8 months ago

@k00ni is there a way to send the source document for your eyes only? It could not be shared in public.

As for the tests with #634, before (using v2.7.0):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 12288 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 775
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:19
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:29
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:90
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:102
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:945
PHP   7. Smalot\PdfParser\RawData\RawDataParser->getRawObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:557
PHP   8. substr([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:775

After (using master+#634):

PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 32768 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 104
PHP Stack trace:
PHP   1. {main}() tests/pdfparser/test.php:0
PHP   2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:10
PHP   3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:20
PHP   4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:91
PHP   5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:103
PHP   6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:958
PHP   7. Smalot\PdfParser\RawData\RawDataParser->decodeStream[redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:104

My env:

$> php --version
PHP 8.2.12 (cli) (built: Oct 26 2023 18:01:05) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.2.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.2.12, Copyright (c), by Zend Technologies
    with Xdebug v3.2.2, Copyright (c) 2002-2023, by Derick Rethans
$> composer --version
Composer version 2.6.5 2023-10-06 10:11:52

Test script:

<?php

ini_set("memory_limit", "128M");

require __DIR__ . '/vendor/autoload.php';

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

$pages = getPDFPageCount('test.pdf', 'test');
echo "File has $pages pages\n";

function getPDFPageCount(string $file, string $origin): mixed
{
    $config = new Config();
    $config->setRetainImageContent(false);
    $parser = new Parser([], $config);
    try {
        $pdf = $parser->parseFile($file);
        $details = $pdf->getDetails();
        return $details['Pages'];
    } catch (Exception $e) {
        $pages = 0;
        echo $e->getMessage();
        return $pages;
    }
}
k00ni commented 8 months ago

@denydias Thank you for your detailled answer. Don't send me the PDF privately, I don't to private support via mail.

634 is the latest big set of changes, there was a chance that it might cover this case. The problem with these errors is, that they seem to be very PDF-dependent. We need further work on the parsing part to avoid endless loops/recursion.

denydias commented 8 months ago

@k00ni I understand you don't provide private support and I'm not asking you to. I'm reporting an issue and looking to privately provide you with the entity where the problem occurs in the hope you can improve your product, but asking no warranties or even replies on that matter.

In most cases I agree with you for the PDF-dependent claim, but this particular one is part of a set with 1.706 files produced by a "pretty standard" (TM) workflow. As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

k00ni commented 8 months ago

As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.

You are right. Would you create a pull request and help us solve the issue?

denydias commented 8 months ago

I'll dive into it when I get the time, @k00ni.

kreuss90 commented 7 months ago

I have the same issue (memory exhausted [in my case 500MB]) also with just one pdf on my website. I will provide a link to the document at the end of this post. Another thing is similar to what @durifal wrote: The document has a colored background. (In opposite to all other documents)

Creator: Microsoft PowerPoint 2016 Link: https://memoone.de/Materialien/5.%20Fortbildungsmaterialien/1.%20Rechnernetze/1.%20Vortrag/1_MAT_Vortrag.pdf

I hope this helps you find the bug. Thanks for providing that great library!

Kind regards Kevin

sj-i commented 7 months ago

To test a development version of our memory profiler, I've tried to investigate the leak in the original issue.

Test script

<?php

use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;

include "vendor/autoload.php";

ini_set('memory_limit', '128M');

register_shutdown_function(
    function (): void {
        $error = error_get_last();
        if (is_null($error)) {
            return;
        }
        if (strpos($error['message'], 'Allowed memory size of') !== 0) {
            return;
        }
        $pid = getmypid();
        $file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
        $line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
        system("sudo reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
    }
);

$config = new Config();
$url = __DIR__ . '/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);

The summary of the memory usage

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .summary
[
  {
    "zend_mm_heap_total": 130023424,
    "zend_mm_heap_usage": 128245688,
    "zend_mm_chunk_total": 46137344,
    "zend_mm_chunk_usage": 44359608,
    "zend_mm_huge_total": 83886080,
    "zend_mm_huge_usage": 83886080,
    "vm_stack_total": 262144,
    "vm_stack_usage": 1632,
    "compiler_arena_total": 458752,
    "compiler_arena_usage": 7264,
    "possible_allocation_overhead_total": 3893453,
    "possible_array_overhead_total": 248704,
    "memory_get_usage": 128276816,
    "memory_get_real_usage": 130023424,
    "cached_chunks_size": 0,
    "heap_memory_analyzed_percentage": 99.97573372884466,
    "php_version": "v82",
    "analyzer": "reli 0.11.0"
  }
]
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type                        | count   | memory_usage
=============                        | =====   | ============
ZendArrayTableMemoryLocation         | 600     | 84052280
ZendStringMemoryLocation             | 1049683 | 38511955
ZendObjectMemoryLocation             | 10278   | 742320
ZendArrayTableOverheadMemoryLocation | 595     | 159296
ObjectsStoreMemoryLocation           | 1       | 131072
ZendArrayMemoryLocation              | 602     | 33712
RuntimeCacheMemoryLocation           | 101     | 7360
CallFrameVariableTableMemoryLocation | 9       | 832
CallFrameHeaderMemoryLocation        | 10      | 800
ZendOpArrayHeaderMemoryLocation      | 1       | 248
StaticMembersTableMemoryLocation     | 5       | 176
ZendResourceMemoryLocation           | 3       | 72
ZendReferenceMemoryLocation          | 2       | 64
ZendMmHugeListMemoryLocation         | 2       | 48

As you can see in the above, arrays and strings occupy the majority of memory consumption. The number of arrays is small, so I doubt that only a few number of arrays are eating up a large size.

Finding the culprit arrays

Let's extract the 20 largest ones in order of size.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayElementsContext"))| . as $path | $root|getpath($path) as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count"}' | jq -rs '(["size", "count", "path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .path])|@tsv' | column -t -o ' | '
size     | count   | path
====     | =====   | ====
41943040 | 1048576 | context.class_table.smalot\\pdfparser\\font.static_properties.uchrCache.array_elements
41913376 | 1047649 | context.call_frames.3.this.object_properties.table.array_elements
36552    | 2284    | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
13720    | 857     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.26_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
8840     | 552     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
6536     | 408     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.545.value.object_properties.value.array_elements
3496     | 218     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.36.value.object_properties.value.array_elements
3480     | 217     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.148.value.object_properties.value.array_elements
3352     | 209     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.428.value.object_properties.value.array_elements
3264     | 70      | context.call_frames.9.symbol_table.array_elements._SERVER.value.array_elements
2216     | 138     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.110.value.object_properties.value.array_elements
1864     | 116     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.413.value.object_properties.value.array_elements
1784     | 111     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.123.value.object_properties.value.array_elements
1696     | 37      | context.call_frames.7.local_variables.xref.array_elements.xref.value.array_elements
1688     | 105     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.572.value.object_properties.value.array_elements
1672     | 104     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.26.value.object_properties.value.array_elements
1608     | 100     | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.299.value.object_properties.value.array_elements
1600     | 34      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements
1496     | 93      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.489.value.object_properties.value.array_elements
1432     | 89      | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.2108.value.object_properties.value.array_elements

Two arrays are the culprits.

Dumping the real stack trace on memory_limit violations is the new feature I want to test on this trial (so not yet released). And it seems that it works well.

~/work/oss/tmp/pdfparser_test$  cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no  function                                                         line
========  ========                                                         ====
0         system                                                           4
1         {closure}(/home/sji/work/oss/tmp/pdfparser_test/test.php:11-21)  20
2         Smalot\\PdfParser\\Font::uchr                                    150
3         Smalot\\PdfParser\\Font::loadTranslateTable                      230
4         Smalot\\PdfParser\\Font::init                                    78
5         Smalot\\PdfParser\\Document::init                                90
6         Smalot\\PdfParser\\Document::setObjects                          316
7         Smalot\\PdfParser\\Parser::parseContent                          122
8         Smalot\\PdfParser\\Parser::parseFile                             90
9         <main>                                                           29

So, two arrays, Font::$uchrCache and Font::$table, are the culprits. Also, the memory_limit violation seems to occur at the point where Font::uchr() is called from Font::loadTranslateTable() at line 230.

Why these arrays grow so large

Then let's also dump some seemingly related local variables.

~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '.context.call_frames."3".local_variables |{char: .char, char_from: .char_from, char_to: .char_to, offset: .offset, key: .key}'
{
  "char": {
    "#node_id": 2147647,
    "#type": "ScalarValueContext",
    "value": 1047644
  },
  "char_from": {
    "#node_id": 2147644,
    "#type": "ScalarValueContext",
    "value": 64287
  },
  "char_to": {
    "#node_id": 2147645,
    "#type": "ScalarValueContext",
    "value": 4276029042
  },
  "offset": {
    "#node_id": 2147646,
    "#type": "ScalarValueContext",
    "value": 4276094578
  },
  "key": {
    "#node_id": 2147638,
    "#type": "ScalarValueContext",
    "value": 50
  }
}
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '.context.call_frames."3".local_variables |.matches.referenced.array_elements."0".value.array_elements."50".value'
{
  "#node_id": 2147143,
  "#type": "StringContext",
  "#locations": [
    {
      "address": 139914548230560,
      "size": 53,
      "refcount": 1,
      "type_info": 22,
      "value": "<FB1F> <FEDF0672> <FEE00672> "
    }
  ]
}

It seems that one of $char_to in the beginbfrange sections has a ligature, so both the translation table and the character cache have grown unintentionally large size.

I am not familiar with the PDF specification, so cannot send a PR to fix it. Sorry.

I am already happy with the successful testing of my tool, and I hope this report can make someone else happy too.

Changelog

denydias commented 7 months ago

...I hope this report can make someone else happy too.

I am! Superb debug job, @sj-i! :clap:

4ndrzej commented 3 months ago

We experiencing the same issue. Any news on this?

intrak commented 1 month ago

Hi There ! Any news with that bug ? This file from first post still are problesome.. I'm on the newest 2.10.0 v.