smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

Trouble parsing a document #643

Open audouts opened 1 year ago

audouts commented 1 year ago

Description:

I'm attempting to parse a document that is primarily tables. Most of the text is a jumble with white space and newlines missing in various places.

PDF input

I would like to provide a sample PDF if there's a way for me to change the text to remove private information.

Expected output & actual output

Here's an example of one of the headers:

Expected

ITEM ID    DESCRIPTION    SERVICE PERIOD    EXTENDED PRICE    TAX    LINE TOTAL

Actual

ITEM ID DESCRIPTIONSERVICE PERIOD       EXTENDED  PRICETAXLINE TOTAL

Code

    $config = new \Smalot\PdfParser\Config();

    // Initialize and load PDF Parser library 
    $parser = new \Smalot\PdfParser\Parser([], $config); 

    // Parse pdf file using Parser library 
    $pdf = $parser->parseFile($Filename); 

    // Extract text from PDF 
    $textContent = $pdf->getText();

    print($textContent);
k00ni commented 1 year ago

Please try #634 and see if it helps with your problem.

viennv commented 1 year ago

Hi, I have tried #634, but it parses my document incorrect, many space between words are removed. ORDEN-de-19-de-septiembre-de-2005-por-la-que-se-desarrollan-determinados-aspectos-del-Plan-Integral-para-la-Prevencion-Seguimiento-y-Control-del-Absentismo-Escolar.pdf

k00ni commented 1 year ago

CC @GreyWyvern It might be relevant for #634?

GreyWyvern commented 1 year ago

Hi, I have tried #634, but it parses my document incorrect, many space between words are removed. ORDEN-de-19-de-septiembre-de-2005-por-la-que-se-desarrollan-determinados-aspectos-del-Plan-Integral-para-la-Prevencion-Seguimiento-y-Control-del-Absentismo-Escolar.pdf

CC @GreyWyvern It might be relevant for #634?

At first glance I don't really know what's going on here. The current v2.7.0 adds spaces to the output, but they are inserted incorectly; the "words" of the output don't match the spacing of the "words" from the document. My PR #634 removes all these spaces, but I guess it doesn't know where to put the right ones back in.

Here's the actual text from the document (copy-pasted from Adobe Acrobat):

ORDEN de 19 de septiembre de 2005, por la
que se desarrollan determinados aspectos del Plan
Integral para la Prevenci‘n, Seguimiento y Control del
Absentismo Escolar.

v2.7.0 outputs this:

ORDE Nde 19 de septiembr ede 2005 ,po rla
qu ese desarrolla ndeterminado saspecto sde lPlan
Integra lpar ala Prevenci‘n ,Seguimient oyContro ldel
Absentism oEscolar.

It seems to be taking the last letter of most words and tacking them onto the beginning of the next word.

634 outputs this:

ORDENde19deseptiembrede2005,porla
quesedesarrollandeterminadosaspectosdelPlan
IntegralparalaPrevenci‘n,SeguimientoyControldel
AbsentismoEscolar.
GreyWyvern commented 1 year ago

So on further investigation, I think this is just a very tightly positioned PDF, and PdfParser is not equipped to handle how precise it is. Some example document stream:

327.333 0 Td
(Sevilla)Tj
24.5168 0 Td
(,)Tj
4.84658 0 Td
(1)Tj
5.31519 0 Td
(7)Tj
7.50419 0 Td
(d)Tj
4.76959 0 Td
(e)Tj
6.47458 0 Td
(octubr)Tj
23.8656 0 Td
(e)Tj
6.47461 0 Td
(2005)Tj

327.333 0 Td positions the "cursor" where 'Sevilla' is printed. Then it moves 24.5168 units to the right and prints a ','. v2.7.0 sees a large enough movement to insert a space here, even though it shouldn't because the width of the text 'Sevilla' should be subtracted from the movement. OTOH, #634 makes a guess as to how wide 'Sevilla' is and is accurate enough to prompt it to remove the space.

Proceeding, after the ',' is printed, we move an additional 4.84658 units to the right. In this case, this movement value is too small for both v2.7.0 and #634 to register it as a space. The fact it is a comma and has very little horizontal width makes this worse. You can see that the movement after printing the '1' and before printing the '7' is even larger than the movement between the comma and the '1'. The final text should be: Sevilla, 17 de octubre 2005.

Where v2.7.0 makes no guess at text width, and #634 makes an attempt at a guess, I feel like the real solution to this is to use the actual character widths from the font if they are available. In this way, PdfParser might be able to more accurately judge where to insert spaces in documents as tightly positioned as this one.

I think it's outside of the scope of #634, personally.

pawel-omniaz commented 11 months ago

Hi, I have same problem on 2.7.0 - sometimes spaces are missing and sometimes they are in wrong places (between letters in on word). I tried 2.8.0-RC2 but it even worse there (more spaces missing).

k00ni commented 11 months ago

CC @GreyWyvern

GreyWyvern commented 11 months ago

Hi, I have same problem on 2.7.0 - sometimes spaces are missing and sometimes they are in wrong places (between letters in on word). I tried 2.8.0-RC2 but it even worse there (more spaces missing).

Do you have a sample PDF we could use to test?

pawel-omniaz commented 11 months ago

1.pdf I have problems with this one. I want to extract text from tables on page 2. and 3. (on right side). "SEGMENT1" is extracted as "S E G M E N T 1" Some products names (those with "T" in name?) has unexpected spaces in them, e.g. "WINSTON" is "WINST ON" after extraction.

GreyWyvern commented 11 months ago

1.pdf I have problems with this one. I want to extract text from tables on page 2. and 3. (on right side). "SEGMENT1" is extracted as "S E G M E N T 1" Some products names (those with "T" in name?) has unexpected spaces in them, e.g. "WINSTON" is "WINST ON" after extraction.

Using 2.7.0 extracts this PDF as an explosion of spaces:

D at a
ob ow i ąz y w a n i a :
07 .11.20 23    
L i c z b a s e gm e n t ów :  2    
PA P I E R O S Y _L 
P r z y g ot o w a ł a:  K or n e l i a
Tyb u r s k a
N a   k r a w a c i e   -   
c y g a r a 

While 2.8.0-RC does somewhat better:

Data
obowiązywania:
07.11.2023
Liczba segmentów: 2
PAPIEROSY_L
Przygotowała: Kornelia
Tyburska
Nakrawacie-
cygara

However, it suffers in the same way as @viennv's document when the text gets very small and packed in those lists you mention on pages 2 and 3. 2.7.0 puts a space between every letter, while 2.8.0-RC removes almost all of the spaces, even ones that probably should remain. I believe this is only because the font is so tiny, but for sure some more tweaking will be necessary in the future.