keensoft / alfresco-simple-ocr

Simple OCR action for Alfresco
Other
44 stars 30 forks source link

OCR not working properly on tiff and jpeg files #31

Closed ayushiagrahari closed 7 years ago

ayushiagrahari commented 7 years ago

I am trying to perform OCR on tiff and jpeg files but showing "Couldn't find trailer dictionary","Couldn't read xref table"," exception Failure("Error: pdfinfo could not determine number of pages. Check the pdf input file.\n")" although the transformation from jpeg or tiff files to PDF files is working properly and the PDF file is visible on the alfresco share page

ayushiagrahari commented 7 years ago

plz help me with this

mikelasla commented 7 years ago

Hi!, it looks like a pdfsandwich issue, what OS do you use? and what software versions?

ayushiagrahari commented 7 years ago

Hello, I am using Ubuntu 16.04 pdfsandwich version 0.1.6

ayushiagrahari commented 7 years ago

Tesseract 3.04 Leptonica 1.73 Unpaper 1.6

angelborroy-ks commented 7 years ago

Is pdfsandwich running from the pure command line? Without using Alfresco?

ayushiagrahari commented 7 years ago

Since,the pdfsandwich expects only pdf files,so firstly I convert the tiff an jpeg files into pdf files ising the convert command and then run the pdfsandwich on it. But the transform method used in ExtractOCR.java is not able to transform the tiff and jpeg images into pdf images.So,actually the pdfsandwich is not working working properly on tiff and jpeg files

angelborroy-ks commented 7 years ago

Yes, you need to use Alfresco Transformation action (from JPG to PDF) before using OCR action.