File parsed returns unreadable text

edodije commented 5 years ago

Thank you for the awesome Pdfparser library, it really helps me a lot with my projects, and use it quite often recently.. But I found a difficulty with my last project which was my pdf file was converted into some unreadable format text instead of plain text.. My friend told me I should do something with the encoding, but I'm not really sure.. So, I would really glad if anybody can give me some hint or idea if I missing something or it was indeed a bug from the library.. I've tried to parse it as a whole and by each pages btw, it was still not working.. notconverted

Here is my code,

$PdfParser = new \Smalot\PdfParser\Parser();
$pdf = $PdfParser->parseFile($file);
$text -> getText();
echo $text;

And this is what it returns,

JHGSA IUYSHJG st GUH st GUH HUYGAH st JHGSA st st ttt t ss1 ss2 t 21666 !" #ssst #tt #t #ss1 #ss2 $# $ t%&t '(#tt 2$# t $$sst 2$ 2$'(#t 2$$sst t%&t ' # ttts !"#$ tt $$ss 2$$ss )tst t"+t t$ )tttst ,"1st$'(#t #s%( tt )t $t tst tss1 , )$ %' # ' # t$st HUYGAH IUYSHJG #$#$ %#$# % "t-." s )t$# )t$'(#t $s(%$" t%&t ttt 1st$%$ss tt tt%&t tt%&t ss/%$sst )$$sst )$'(#t %%'0t+#1 2sst t1 )tst )t$$ss tt%&t ttts )$# ,"t%&t ! "#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y ! Z,-. JHGSA IUYSHJG GUH HUYGAH #sst t ss%tt# )ttt )t, )t )t($ '3st t % # ts%'1 )t*#t /%t$ss ttt%&t & tt(% & t-tts tt )tt 2$%$ss )tss2 )t' # t$st 2$tt%$ss 1st$$ss"tt t )t$$ss"tt t$%$ss t$$sst s$t%t )t%&t )$$ss )$%$ss )$tt%$ss $ t' ss ss/%t ' ss IUYSHJG& ' ss$ %%t1 tss4t tss5st tss5 tss5t tss51 tss52 ' # tst t' # ttts )t' st t$%" tt+t%" st%&t s&tt sss st sst st ! //$01$1% ) 2 (2) 2 () 31!#%%#% !"#"#%!& !"#"#%!& &$4$04%%%% ' ' ! &$4$04%%%% !"#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y # Z,-.

k00ni commented 4 years ago

Can you upload the PDF here so we can use it to reproduce the error? It must be free to use, because we will use it in the test suite.

taherbth commented 1 year ago

Hello, I could parse pdf using Smalot\PdfParser and would like to say it's awesome work you have done. Only one problem I face is it's unable to show exact text when Language is : Bengali. Is there any option to get exact text when it's in Bengali Language?

k00ni commented 1 year ago

@taherbth how is that related to this issue? If it is not, please open a new issue and provide some more information, like example code/PDF, more info about your setup like PHP version-

lgArlequin commented 10 months ago

Hi @k00ni, I have the same problem with one of my pdfs, at first I thought it was a multipage issue, but if I extract only 1, it continues doing so.

My PHP version is 7.4.29. I attach an example (it is an extracted page since the original file has sensitive information).

The original, apparently, is generated with PDFlib ([Producer] => PDFlib+PDI 9.1.2p1 (C++ legacy/Win64)) but if I edit it with the Mac preview ([Producer] => macOS Version 13.6.1 (Build 22G313) Quartz PDFContext), works fine, but I have no way to manually edit the system originals. For the example, I used the Fpdi library ([Producer] => FPDF 1.86) which has the same parsing errors.

If you need any other information I can provide it. sample.pdf

k00ni commented 10 months ago

@lgArlequin Which PDFParser version do you use? Our latest version is 2.8.0-RC1: https://github.com/smalot/pdfparser/releases/tag/v2.8.0-RC1

lgArlequin commented 10 months ago

@k00ni Yes, I changed to that version just in case and I still have the problem. I'm not using composer, I install it manually, I don't know if that can change anything.

smalot / pdfparser

File parsed returns unreadable text #246