modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.94k stars 376 forks source link

unexpected space #352

Open ymouhat opened 1 week ago

ymouhat commented 1 week ago

Hello We have identified the following behavior using the following PDF to parse it, BALGRP.pdf

image

image

Could you please look at it ?

Many thanks in advance

modesty commented 1 week ago

-m would turn on PROCESS_MERGE_BROKEN_TEXT_BLOCKS, tried it?

ymouhat commented 1 week ago

Hi @modesty THank you, the developer will look into it We let you know how it goes.

JordiSAGE commented 1 week ago

Hello @modesty I've tried adding the parameter programmatically both before and after creating the PDF object, but unfortunately it didn't work.

    process.env.PROCESS_MERGE_BROKEN_TEXT_BLOCKS = 'true';
    const pdfParser = new PDFParser(this, true);
    process.env.PROCESS_MERGE_BROKEN_TEXT_BLOCKS = 'true';

'General balance (Provisional) 6/17/2024 Company : ATP2 ATP2 - ATP Samples Currency : USD Legislation : USA USA Balance to 12/31/2023 Txs on 1/1/2024 to 12/31/2024 Balance to 12/31/2024 Account no Account heading Debit Credit Debit Credit Debit Credit 10100 Bank Account 1,070.00 1,070.00 12100 Accounts Receivable 2,740.00 1,070.00 1,670.00 12400 Shipped Not Invoiced Clearing 600.00 600.00 17000 FA - Construction in Progress 1,000.00 1,000.00 20100 Accounts Payable 3,000.00 3,000.00 25100 Sales Tax Payable 140.00 140.00 4 1100 Sales Revenue 2,600.00 2,600.00 41900 Sales Revenue - Clearing 600.00 600.00 70900 Miscellaneous Expense 2,000.00 2,000.00 Balance total 4,810.00 4,810.00 Totals management 2,600.00 2,600.00 Off-balance-sheet total COM PANY TOTAL ATP2 ATP2 - ATP Samples 7,410.00 7,410.00 Page 1 of 1'

FYI @ymouhat

JordiSAGE commented 1 week ago

Hello again @modesty I've tried as well, using the getMergedTextBlocksIfNeeded method, but it seems that this is not available on the PDFParser object anymore. image

FYI @ymouhat

JordiSAGE commented 1 week ago

Hi @modesty I've integrated the pdf2json source code into the project, and it seems that this is merging some blocks correctly, but some others not, for instance, this is removing the space before ATP2 in 'Company :ATP2' or 'Account noAccount heading' that is supposed to have a large space, but it did it well in case of 'COMPANY', maybe the space distance threshold calculation on the method areAdjacentBlocks from pdf2json is not working properly.

https://github.com/modesty/pdf2json/blob/f7c473772bec66616b6099bb38939f1f1f2be8ef/lib/pdffont.js#L145

General balance (Provisional) 6/17/2024 Company :ATP2 ATP2 - ATP Samples Currency :USD Legislation :USA USABalance to 12/31/2023 Txs on 1/1/2024 to 12/31/2024Balance to 12/31/2024 Account noAccount heading DebitCredit DebitCredit DebitCredit 10100Bank Account 1,070.00 1,070.00 12100Accounts Receivable 2,740.00 1,070.00 1,670.00 12400Shipped Not Invoiced Clearing 600.00 600.00 17000FA - Construction in Progress 1,000.00 1,000.00 20100Accounts Payable 3,000.00 3,000.00 25100Sales Tax Payable 140.00 140.00 41100 Sales Revenue 2,600.00 2,600.00 41900Sales Revenue - Clearing 600.00 600.00 70900Miscellaneous Expense 2,000.00 2,000.00 Balance total 4,810.00 4,810.00 Totals management 2,600.00 2,600.00 Off-balance-sheet total COMPANY TOTAL ATP2ATP2 - ATP Samples 7,410.00 7,410.00 Page 1 of 1

image image

FYI: @ymouhat