Open madhur-dumane opened 6 years ago
Can you share the PDFs?
its contain confidential data so we can not share
Above is the first page of PDF and respective retrieved text is like:-
Length 18332 Text : jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF AJA
from last 3 lines its retrieving correctly but Monthly Trend Report that is coming in junk characters.I have erased some data from image. similarly on next page "Page3 of 8" like text is there which is also retrived similar to this only. Note-I have replaced generated time with null string means while retrieving covert string"Generated----EST" to null.
Please help us to resolve this issue.
any solution?
Unfortunately I am unable to replicate the issue. What OS do you use?
Windows 7
any solution on this?
I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.
I am checking if I could share PDF with you but if issue will be in pdf box then can we fix it or not?
Is there any other way to do PDF Comparison
after more analysis we got to know that if the font of text in PDF is [PDType1CFont SUBSET+CZGA00T1U00037] then its not retriving text correctly. The pdf document which we are trying to read is having some custom fonts embedded in it. Can you check now what can be done?
Can you please check on this?
Hi
Can you please check my latest comment on this issue and let us know if you have any solution.
Thanks in advance Madhuri
On 21-Nov-2017 1:00 AM, "Vinoth Selvaraj" notifications@github.com wrote:
I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vinsguru/pdf-util/issues/9#issuecomment-345803119, or mute the thread https://github.com/notifications/unsubscribe-auth/AfjAxDP6gl0ftLqNFOduGZhVDPo2kca_ks5s4dNBgaJpZM4QFc4n .
I am using below code to get whole PDF text into strings and then compare of both string. String str = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+prereport+".pdf"); String str1 = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+postreport+".pdf"); System.out.println("Check the text from both PDFs : " + str.equalsIgnoreCase(str1));
When I retrieve pdf text into string instead of text am getting below type of characters in retrieved string. jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF