PDF Comparison failing though both PDFs are same because on non sequencial retrival of text

madhur-dumane commented 6 years ago

I am using below code to get whole PDF text into strings and then compare of both string. String str = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+prereport+".pdf"); String str1 = pdfutil.getText("C:\Users\"+System.getProperty("user.name")+"\Downloads"+"\"+postreport+".pdf"); System.out.println("Check the text from both PDFs : " + str.equalsIgnoreCase(str1));

sometimes retrival of text is not sequencial.Ex-
suppose from 1 PDF its retrieved text like --- $497.10 0.51 - Investment Cash from 2nd PDF its retrieving text like --- $497.10 -0.51 Investment Cash

in one string there is 0.51 - and in other string -0.51 so PDF comparison is failing. pdfcontent Please see above screenshot how it looks in actual both PDFs. Ideally it should retrieve sequentially and PDF Comparison should be successfully .Please help me to resolve this issue.

vinsguru commented 6 years ago

Use mvn version 0.0.3

I have modified to use your own Stripper using pdfbox. Below sample code should retrieve the text based on the position.

        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setSortByPosition(true);
        pdfutil.useStripper(stripper);

       pdfutil.getText(filePath);

madhur-dumane commented 6 years ago

I have not changed maven version but using above code it is working. Thank you so much for your help

vinsguru commented 6 years ago

Glad that it worked

vinsguru / pdf-util

PDF Comparison failing though both PDFs are same because on non sequencial retrival of text #8