smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.38k stars 535 forks source link

File parsed returns unreadable text #246

Open edodije opened 5 years ago

edodije commented 5 years ago

Thank you for the awesome Pdfparser library, it really helps me a lot with my projects, and use it quite often recently.. But I found a difficulty with my last project which was my pdf file was converted into some unreadable format text instead of plain text.. My friend told me I should do something with the encoding, but I'm not really sure.. So, I would really glad if anybody can give me some hint or idea if I missing something or it was indeed a bug from the library.. I've tried to parse it as a whole and by each pages btw, it was still not working.. notconverted

Here is my code,

$PdfParser = new \Smalot\PdfParser\Parser();
$pdf = $PdfParser->parseFile($file);
$text -> getText();
echo $text;

And this is what it returns,

                                  JHGSA IUYSHJG st  GUH  st  GUH HUYGAH st  JHGSA  st  st  tt t t  ss1 ss2 t   21666        ! "  # ssst   # t t  #t  #ss1  #ss2 $# $     t%&t '(# t t  2$# t   $$ss t 2$   2$'(# t 2$$ss t t%&t ' # tt ts !"#$  tt $$ss   2$$ss ) tst   t" + t  t$   ) t ttst  ," 1st$'(# t #s%(  t t   ) t $t   t st   tss1 ,  )$   %' #  ' # t$st HUYGAH IUYSHJG               #$#$ %#$# %  "t-." s   ) t$# ) t$'(# t $s(%$ " t%&t  tt t 1st$ %$ss  tt  tt%&t  tt%&t ss/%$ss t )$$ss t )$'(# t %%'0t+#1 2 sst t1 ) tst  ) t$$ss  t t%&t  t t ts )$# ,"  t%&t   ! "#$"#%!&   !"#"#%!& !"#"#%!&    ! ' ' ( )Y  !  Z ,-.                                     JHGSA IUYSHJG   GUH HUYGAH  # sst t ss% tt# ) t t t ) t, ) t   ) t ($  '3st  t   % #   ts%'1 ) t*#t  /% t$ss t tt%&t & tt(%  & t-t ts  tt ) tt  2$ %$ss ) tss2 ) t' # t$st 2$ t t%$ss 1st$$ss"tt  t    ) t$$ss"tt  t$ %$ss  t$$ss t s $t% t   ) t%&t )$$ss )$ %$ss )$ t t%$ss  $   t' ss ss/% t  ' ss IUYSHJG&  ' ss$ %%t1  tss4t  tss5st   tss5  tss5t  tss51  tss52 ' # tst  t' # tt ts  ) t' st  t$% "  tt+ t% " st%&t s&t t sss st sst  st ! //$01$1% ) 2 (2 ) 2 () 31!#%%#% !"#"#%!& !"#"#%!& &$4$04%%%%  ' ' ! &$4$04%%%%   !"#$"#%!&   !"#"#%!& !"#"#%!&    ! ' ' ( )Y  #  Z ,-.

k00ni commented 4 years ago

Can you upload the PDF here so we can use it to reproduce the error? It must be free to use, because we will use it in the test suite.

taherbth commented 1 year ago

Hello, I could parse pdf using Smalot\PdfParser and would like to say it's awesome work you have done. Only one problem I face is it's unable to show exact text when Language is : Bengali. Is there any option to get exact text when it's in Bengali Language?

k00ni commented 1 year ago

@taherbth how is that related to this issue? If it is not, please open a new issue and provide some more information, like example code/PDF, more info about your setup like PHP version-

lgArlequin commented 10 months ago

Hi @k00ni, I have the same problem with one of my pdfs, at first I thought it was a multipage issue, but if I extract only 1, it continues doing so.

My PHP version is 7.4.29. I attach an example (it is an extracted page since the original file has sensitive information).

The original, apparently, is generated with PDFlib ([Producer] => PDFlib+PDI 9.1.2p1 (C++ legacy/Win64)) but if I edit it with the Mac preview ([Producer] => macOS Version 13.6.1 (Build 22G313) Quartz PDFContext), works fine, but I have no way to manually edit the system originals. For the example, I used the Fpdi library ([Producer] => FPDF 1.86) which has the same parsing errors.

If you need any other information I can provide it. sample.pdf

k00ni commented 10 months ago

@lgArlequin Which PDFParser version do you use? Our latest version is 2.8.0-RC1: https://github.com/smalot/pdfparser/releases/tag/v2.8.0-RC1

lgArlequin commented 10 months ago

@k00ni Yes, I changed to that version just in case and I still have the problem. I'm not using composer, I install it manually, I don't know if that can change anything.