Closed code-mage-com closed 6 months ago
Nice catch! :D
I think it would be better to strictly whitelist only hexadecimal digits rather than just excluding newlines and carriage returns:
$value = preg_replace('/[^0-9a-f]/i', '', $value);
But both definitely fix the issue with this file,
This should be a simple change, and if you can add your PDF to the test folder, you have a great file for a unit test. Do you want to create a PR for this fix?
@code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?
Hi Brian,
sorry, have had no time to look at the PR (don't think I'll be able to get time to make one)...
As to the PDF document, no problem adding that to the test case.
Best Regards, Dmitri
On Wed, 6 Mar 2024 at 15:28, Brian Huisman @.***> wrote:
@code-mage-com https://github.com/code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?
— Reply to this email directly, view it on GitHub https://github.com/smalot/pdfparser/issues/683#issuecomment-1980993843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQEFRCHVVYJN3PJJNK2XS7TYW4RXZAVCNFSM6AAAAABD2SYTRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQHE4TGOBUGM . You are receiving this because you were mentioned.Message ID: @.***>
Description:
When parsing attached PDF file, the keywords include weird "japanese" characters such as "挀爀椀猀琀椀愀渀愀ⰰ 最攀猀豈ⰰ 甀漀ݠؐȁـڐȁذڐ۰ذذ۰ۀؐ݀ؐˀȁذ۰niglietti"
After debugging, I found out that the reason is in ElementHexa::decode() where the value parameter is passed a hex string that is split in several lines of 80 chars per line, like so:
I managed to get it working correctly by adding the preg_replace before initial length calculation here (to remove carriage returns and newlines, if any, before parsing):
PDF input
meta1.pdf
Expected output & actual output
Expected output:
Actual output (without the preg_replace line):
Code