junk characters coming in retrieved text using PDF util class method getText

vinsguru / pdf-util

PDF Compare Utility

98 stars 69 forks source link

junk characters coming in retrieved text using PDF util class method getText #9

Open madhur-dumane opened 6 years ago

madhur-dumane commented 6 years ago

I am using below code to get whole PDF text into strings and then compare of both string. String str = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+prereport+".pdf"); String str1 = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+postreport+".pdf"); System.out.println("Check the text from both PDFs : " + str.equalsIgnoreCase(str1));

When I retrieve pdf text into string instead of text am getting below type of characters in retrieved string. jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF

vinsguru commented 6 years ago

Can you share the PDFs?

madhur-dumane commented 6 years ago

its contain confidential data so we can not share

madhur-dumane commented 6 years ago

pdf

Above is the first page of PDF and respective retrieved text is like:-

Length 18332 Text : jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF AJA

from last 3 lines its retrieving correctly but Monthly Trend Report that is coming in junk characters.I have erased some data from image. similarly on next page "Page3 of 8" like text is there which is also retrived similar to this only. Note-I have replaced generated time with null string means while retrieving covert string"Generated----EST" to null.

Please help us to resolve this issue.

madhur-dumane commented 6 years ago

any solution?

vinsguru commented 6 years ago

Unfortunately I am unable to replicate the issue. What OS do you use?

madhur-dumane commented 6 years ago

Windows 7

madhur-dumane commented 6 years ago

any solution on this?

vinsguru commented 6 years ago

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

madhur-dumane commented 6 years ago

I am checking if I could share PDF with you but if issue will be in pdf box then can we fix it or not?

madhur-dumane commented 6 years ago

Is there any other way to do PDF Comparison

madhur-dumane commented 6 years ago

after more analysis we got to know that if the font of text in PDF is [PDType1CFont SUBSET+CZGA00T1U00037] then its not retriving text correctly. The pdf document which we are trying to read is having some custom fonts embedded in it. Can you check now what can be done?

madhur-dumane commented 6 years ago

Can you please check on this?

madhur-dumane commented 6 years ago

Can you please check my latest comment on this issue and let us know if you have any solution.

Thanks in advance Madhuri

On 21-Nov-2017 1:00 AM, "Vinoth Selvaraj" notifications@github.com wrote:

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vinsguru/pdf-util/issues/9#issuecomment-345803119, or mute the thread https://github.com/notifications/unsubscribe-auth/AfjAxDP6gl0ftLqNFOduGZhVDPo2kca_ks5s4dNBgaJpZM4QFc4n .