getText() returns text without any spaces when using a pdf from google docs

smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

GNU Lesser General Public License v3.0

2.37k stars 537 forks source link

getText() returns text without any spaces when using a pdf from google docs #675

Open veepdotai opened 7 months ago

veepdotai commented 7 months ago

PHP Version: PHP 8.3.0 (cli) (built: Nov 24 2023 13:48:03) (NTS) Copyright (c) The PHP Group Zend Engine v4.3.0, Copyright (c) Zend Technologies with Zend OPcache v8.3.0, Copyright (c), by Zend Technologies
PDFParser Version: "smalot/pdfparser": "^2.8"

Description:

If I create a document in google docs and download it in pdf format from google, I just get some text witout any spaces when parsing it with pdfparser.

PDF input

test-vgwg.pdf

Expected output & actual output

input in google docs : Very good work guys Thanks for everything.

download as pdf

I just do the code below

I get the following output:

Verygood workguys Thanksforeverything.

Code

use Smalot\PdfParser\Parser; $file = "test-vgwg.pdf"; $parser = new Parser(); $pdf = $parser->parseFile($file); $output = $pdf->getText(); var_dump($output);

GreyWyvern commented 7 months ago

This seems to have something to do with how I've used the negative of the current position factor in PDFObject.php.

$factorX = -$current_font_size * $current_position_tm['a'] - $current_font_size * $current_position_tm['i'];

When I change this to positive:

$factorX = $current_font_size * $current_position_tm['a'] + $current_font_size * $current_position_tm['i'];

Then the OP's sample document prints with the proper spaces. However, I think changing this line also breaks a lot of the unit tests. Somehow Google Docs is playing around with negative values that I haven't accounted for here. I'll have to look into it more.

LaRaye commented 6 months ago

i've been having same issue. it looks like v2.9 does not address this bug. any update?

lopatin96 commented 5 months ago

Can someone fix it please?

GreyWyvern commented 5 months ago

I'm looking at this and with the initial changes that read the OP's doc correctly, I can whittle it down to 3 unit test failures. I'm studying the failures to see if they are actually valid, or if more tweaking is needed.

lopatin96 commented 4 months ago

@GreyWyvern thank you very much for all your progress invested in the development of this project! I really appreciate it! Please tell me, are there any changes regarding this “bug”?

GreyWyvern commented 4 months ago

I'm still working on this. The fix involves using the matrix from cm commands as well as the Td and TD commands. Right now PdfParser only uses them from the Td and TD commands. However, while just inserting it gets me 98% of the way to a fix, There are two or three unit test PDFs where if I "fix" it for one, the other two break, and vice versa. 😩

Hopefully soon!

LaRaye commented 4 months ago

Got it. Will keep an eye out. Really appreciate all your efforts!!

lopatin96 commented 4 months ago

@GreyWyvern Got it. Thanks for the info and good luck solving this problem.

hgalt commented 3 months ago

@GreyWyvern is there any progress in the meantime? Due to this problem, the PDFParser is currently useless for me! I have the feeling, there are not only spaces where are missing, but also \t. Any workaround before a new version? Thanks a lot.

GreyWyvern commented 3 months ago

Can you try this fork, @hgalt ? https://github.com/GreyWyvern/pdfparser/tree/google-docs

Does it solve your problem? I've boiled down a HUGE amount of changes to the small edits in the fork above. I like the fix (if it solves your issue!) because it's simple, but it has the consequence of adding unnecessary tabs in several other files. Extra tabs is definitely a smaller issue than complete lack of spaces (extra whitespace can easily be stripped by the user), so it might be good to send this as-is as a PR.

lopatin96 commented 3 months ago

@GreyWyvern For me, additional tabs are not a problem (I can then remove them using my code), it is much worse if there are no spaces between words. Please send it as a PR. And thanks for the work done!

LaRaye commented 3 months ago

@GreyWyvern Same here! Difficult to parse text with no whitespace. Extra tabs can be removed. Appreciate your hard work!

hgalt commented 3 months ago

@GreyWyvern I do something wrong, because I can not load the branch via composer. I added under require "greywyvern/pdfparser": "google-docs" and get the error Could not parse version constraint

hgalt commented 3 months ago

@GreyWyvern got it working without composer. This fork works for me! Great job, thaks for your effort.

gilney-canaltelecom commented 2 weeks ago

Hi, any news about this?

I'm having the same problem with the following pdf, made using WPS Office:

small_pdf.pdf

I tried the google-docs branch and the output still comes without spaces ;/