Open durifal opened 11 months ago
Thanks for reporting. What program did you use to generate the PDF? To be sure the error still exist, please try again with latest master branch.
I do not know what generated the PDF, because visitors of our sites uploaded it as Cover letter, which we try to parse so full-text would search also in attachment. I have just edit the PDF in Adobe PDF editor to anonymize data.
We hit this problem multiple times during parsing the PDFs, so if necessary I can anonymize more examples. But it is pretty rare (about 10 PDFs out of 1 000 000). All of them had on one site text on some background color.
I have tested problematic PDF with the same result also on master branch:
Fatal error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ........../smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 230
Thank you for the feedback.
A similar issue also hit me. I'll post this here as this looks like a common unhandled exception, but let me know if you need an specific issue. Just like the OP, a small portion of a much larger batch appears to be affected.
As for the PDF creator:
Creator: Adobe Acrobat 7.0
Producer: Adobe Acrobat 7.0 Paper Capture Plug-in
PdfParser exception:
[2023-10-21 07:48:18] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) {
"userId":2,"exception":"[object] (
Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
Allowed memory size of 134217728 bytes exhausted (tried to allocate 204800 bytes) at
vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
)
[stacktrace]
#0 {main}"
}
[2023-10-21 07:48:20] ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) {
"userId":2,"exception":"[object] (
Symfony\\Component\\ErrorHandler\\Error\\FatalError(code: 0):
Allowed memory size of 134217728 bytes exhausted (tried to allocate 188416 bytes) at
vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/FilterHelper.php:239
)
[stacktrace]
#0 {main}"
}
@denydias can you provide your PDFs, which cause this exception?
Also, try #634 and check if the exception remains.
Thank you for the quick reply, @k00ni! I'll try the PR and let you know the results. Please expect some delay as these are very busy days here.
@k00ni is there a way to send the source document for your eyes only? It could not be shared in public.
As for the tests with #634, before (using v2.7.0):
PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 12288 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 775
PHP Stack trace:
PHP 1. {main}() tests/pdfparser/test.php:0
PHP 2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:19
PHP 3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:29
PHP 4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:90
PHP 5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:102
PHP 6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:945
PHP 7. Smalot\PdfParser\RawData\RawDataParser->getRawObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:557
PHP 8. substr([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:775
After (using master+#634):
PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 32768 bytes) in tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 104
PHP Stack trace:
PHP 1. {main}() tests/pdfparser/test.php:0
PHP 2. getPDFPageCount($file = 'test.pdf', $origin = 'test') tests/pdfparser/test.php:10
PHP 3. Smalot\PdfParser\Parser->parseFile($filename = 'test.pdf') tests/pdfparser/test.php:20
PHP 4. Smalot\PdfParser\Parser->parseContent([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:91
PHP 5. Smalot\PdfParser\RawData\RawDataParser->parseData([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/Parser.php:103
PHP 6. Smalot\PdfParser\RawData\RawDataParser->getIndirectObject([redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:958
PHP 7. Smalot\PdfParser\RawData\RawDataParser->decodeStream[redacted]) tests/pdfparser/vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:104
My env:
$> php --version
PHP 8.2.12 (cli) (built: Oct 26 2023 18:01:05) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.2.12, Copyright (c) Zend Technologies
with Zend OPcache v8.2.12, Copyright (c), by Zend Technologies
with Xdebug v3.2.2, Copyright (c) 2002-2023, by Derick Rethans
$> composer --version
Composer version 2.6.5 2023-10-06 10:11:52
Test script:
<?php
ini_set("memory_limit", "128M");
require __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;
$pages = getPDFPageCount('test.pdf', 'test');
echo "File has $pages pages\n";
function getPDFPageCount(string $file, string $origin): mixed
{
$config = new Config();
$config->setRetainImageContent(false);
$parser = new Parser([], $config);
try {
$pdf = $parser->parseFile($file);
$details = $pdf->getDetails();
return $details['Pages'];
} catch (Exception $e) {
$pages = 0;
echo $e->getMessage();
return $pages;
}
}
@denydias Thank you for your detailled answer. Don't send me the PDF privately, I don't to private support via mail.
@k00ni I understand you don't provide private support and I'm not asking you to. I'm reporting an issue and looking to privately provide you with the entity where the problem occurs in the hope you can improve your product, but asking no warranties or even replies on that matter.
In most cases I agree with you for the PDF-dependent claim, but this particular one is part of a set with 1.706 files produced by a "pretty standard" (TM) workflow. As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.
As just this one triggers an exception, it looks like a perfect candidate for an edge case worth looking into. But this is not my call.
You are right. Would you create a pull request and help us solve the issue?
I'll dive into it when I get the time, @k00ni.
I have the same issue (memory exhausted [in my case 500MB]) also with just one pdf on my website. I will provide a link to the document at the end of this post. Another thing is similar to what @durifal wrote: The document has a colored background. (In opposite to all other documents)
Creator: Microsoft PowerPoint 2016 Link: https://memoone.de/Materialien/5.%20Fortbildungsmaterialien/1.%20Rechnernetze/1.%20Vortrag/1_MAT_Vortrag.pdf
I hope this helps you find the bug. Thanks for providing that great library!
Kind regards Kevin
To test a development version of our memory profiler, I've tried to investigate the leak in the original issue.
<?php
use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;
include "vendor/autoload.php";
ini_set('memory_limit', '128M');
register_shutdown_function(
function (): void {
$error = error_get_last();
if (is_null($error)) {
return;
}
if (strpos($error['message'], 'Allowed memory size of') !== 0) {
return;
}
$pid = getmypid();
$file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
$line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
system("sudo reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
}
);
$config = new Config();
$url = __DIR__ . '/test_pdf.pdf';
$config->setRetainImageContent(false);
$config->setDecodeMemoryLimit(1000000);
$parser = new Parser([], $config);
$pdf = $parser->parseFile($url);
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .summary
[
{
"zend_mm_heap_total": 130023424,
"zend_mm_heap_usage": 128245688,
"zend_mm_chunk_total": 46137344,
"zend_mm_chunk_usage": 44359608,
"zend_mm_huge_total": 83886080,
"zend_mm_huge_usage": 83886080,
"vm_stack_total": 262144,
"vm_stack_usage": 1632,
"compiler_arena_total": 458752,
"compiler_arena_usage": 7264,
"possible_allocation_overhead_total": 3893453,
"possible_array_overhead_total": 248704,
"memory_get_usage": 128276816,
"memory_get_real_usage": 130023424,
"cached_chunks_size": 0,
"heap_memory_analyzed_percentage": 99.97573372884466,
"php_version": "v82",
"analyzer": "reli 0.11.0"
}
]
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type | count | memory_usage
============= | ===== | ============
ZendArrayTableMemoryLocation | 600 | 84052280
ZendStringMemoryLocation | 1049683 | 38511955
ZendObjectMemoryLocation | 10278 | 742320
ZendArrayTableOverheadMemoryLocation | 595 | 159296
ObjectsStoreMemoryLocation | 1 | 131072
ZendArrayMemoryLocation | 602 | 33712
RuntimeCacheMemoryLocation | 101 | 7360
CallFrameVariableTableMemoryLocation | 9 | 832
CallFrameHeaderMemoryLocation | 10 | 800
ZendOpArrayHeaderMemoryLocation | 1 | 248
StaticMembersTableMemoryLocation | 5 | 176
ZendResourceMemoryLocation | 3 | 72
ZendReferenceMemoryLocation | 2 | 64
ZendMmHugeListMemoryLocation | 2 | 48
As you can see in the above, arrays and strings occupy the majority of memory consumption. The number of arrays is small, so I doubt that only a few number of arrays are eating up a large size.
Let's extract the 20 largest ones in order of size.
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayElementsContext"))| . as $path | $root|getpath($path) as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count"}' | jq -rs '(["size", "count", "path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .path])|@tsv' | column -t -o ' | '
size | count | path
==== | ===== | ====
41943040 | 1048576 | context.class_table.smalot\\pdfparser\\font.static_properties.uchrCache.array_elements
41913376 | 1047649 | context.call_frames.3.this.object_properties.table.array_elements
36552 | 2284 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
13720 | 857 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.26_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
8840 | 552 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements
6536 | 408 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.545.value.object_properties.value.array_elements
3496 | 218 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.36.value.object_properties.value.array_elements
3480 | 217 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.148.value.object_properties.value.array_elements
3352 | 209 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.428.value.object_properties.value.array_elements
3264 | 70 | context.call_frames.9.symbol_table.array_elements._SERVER.value.array_elements
2216 | 138 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.110.value.object_properties.value.array_elements
1864 | 116 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.413.value.object_properties.value.array_elements
1784 | 111 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.123.value.object_properties.value.array_elements
1696 | 37 | context.call_frames.7.local_variables.xref.array_elements.xref.value.array_elements
1688 | 105 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.572.value.object_properties.value.array_elements
1672 | 104 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.26.value.object_properties.value.array_elements
1608 | 100 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.299.value.object_properties.value.array_elements
1600 | 34 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements
1496 | 93 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.30_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.489.value.object_properties.value.array_elements
1432 | 89 | context.call_frames.3.this.object_properties.document.object_properties.objects.array_elements.34_0.value.object_properties.header.object_properties.elements.array_elements.W.value.object_properties.value.array_elements.2108.value.object_properties.value.array_elements
Two arrays are the culprits.
Dumping the real stack trace on memory_limit violations is the new feature I want to test on this trial (so not yet released). And it seems that it works well.
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no function line
======== ======== ====
0 system 4
1 {closure}(/home/sji/work/oss/tmp/pdfparser_test/test.php:11-21) 20
2 Smalot\\PdfParser\\Font::uchr 150
3 Smalot\\PdfParser\\Font::loadTranslateTable 230
4 Smalot\\PdfParser\\Font::init 78
5 Smalot\\PdfParser\\Document::init 90
6 Smalot\\PdfParser\\Document::setObjects 316
7 Smalot\\PdfParser\\Parser::parseContent 122
8 Smalot\\PdfParser\\Parser::parseFile 90
9 <main> 29
So, two arrays, Font::$uchrCache
and Font::$table
, are the culprits. Also, the memory_limit violation seems to occur at the point where Font::uchr() is called from Font::loadTranslateTable() at line 230.
Then let's also dump some seemingly related local variables.
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq '.context.call_frames."3".local_variables |{char: .char, char_from: .char_from, char_to: .char_to, offset: .offset, key: .key}'
{
"char": {
"#node_id": 2147647,
"#type": "ScalarValueContext",
"value": 1047644
},
"char_from": {
"#node_id": 2147644,
"#type": "ScalarValueContext",
"value": 64287
},
"char_to": {
"#node_id": 2147645,
"#type": "ScalarValueContext",
"value": 4276029042
},
"offset": {
"#node_id": 2147646,
"#type": "ScalarValueContext",
"value": 4276094578
},
"key": {
"#node_id": 2147638,
"#type": "ScalarValueContext",
"value": 50
}
}
~/work/oss/tmp/pdfparser_test$ cat memory_analyzed.json | jq -r '.context.call_frames."3".local_variables |.matches.referenced.array_elements."0".value.array_elements."50".value'
{
"#node_id": 2147143,
"#type": "StringContext",
"#locations": [
{
"address": 139914548230560,
"size": 53,
"refcount": 1,
"type_info": 22,
"value": "<FB1F> <FEDF0672> <FEE00672> "
}
]
}
It seems that one of $char_to
in the beginbfrange sections has a ligature, so both the translation table and the character cache have grown unintentionally large size.
I am not familiar with the PDF specification, so cannot send a PR to fix it. Sorry.
I am already happy with the successful testing of my tool, and I hope this report can make someone else happy too.
...I hope this report can make someone else happy too.
I am! Superb debug job, @sj-i! :clap:
We experiencing the same issue. Any news on this?
Hi There ! Any news with that bug ? This file from first post still are problesome.. I'm on the newest 2.10.0 v.
Description:
Trying to parse this PDF always result in Allowed memory exhausted error.
Error: Allowed memory size of 1077936128 bytes exhausted (tried to allocate 335544320 bytes) in ...../smalot/pdfparser/src/Smalot/PdfParser/Font.php, line 223
Set up PHP memory limit to 4GB did not help either. I have also tried to setDecodeMemoryLimit to lower but still had the same memory issue. Setting Decode memory limit prevent the error only when I set it to 1000 or lower. So maybe it should be set in MB and not in bytes, or there is an bug in the code.
PDF input
test_pdf.pdf
Expected output & actual output
Parser should either parse the text from the PDF, or return empty string or some exception and not memory error.
Code