Trying to export image to PDF,image OK, searchable text NO

r1me / TTesseractOCR4

Object Pascal binding for tesseract-ocr - an optical character recognition engine

MIT License

145 stars 46 forks source link

Trying to export image to PDF,image OK, searchable text NO #3

Closed ericduarte closed 6 years ago

ericduarte commented 6 years ago

Hello

I've tried to export image to PDF, it generates the pdf, but the text is not searchable.

r1me commented 6 years ago

I assume you've downloaded and provided path to language data and specified it in call to Tesseract.Initialize, default is English. If trained data doesn't match language that you want to OCR, PDF file will be created but there will be no text to search in it.

Please try OCR your image with examples\delphi-console-simple example, and post your results (is text returned in console).

ericduarte commented 6 years ago

Thanks for your attention,

the examples\delphi-console-simple example works fine but examples\delphi-console-pdfconvert does not.

    if Tesseract.Initialize('tessdata\', 'eng') then
    begin
      inputFileName := 'samples\multi-page.tif';
      outputFileName := 'multi-page.pdf';

      if Tesseract.CreatePDF(inputFileName, outputFileName) then
      begin
        WriteLn('PDF was saved succesfully to ' + outputFileName);
        ReadLn;
      end;
    end;

r1me commented 6 years ago

Please attach input image and output PDF. PS. Don't copy paste example source code but include actual code (if needed).

ericduarte commented 6 years ago

I'm using Delphi 10 Seatle, and did a litle change in tesseractocr.consts.pas, included it

{$IFDEF VER300} type PUTF8Char = PAnsiChar; {$ENDIF}

and changed it

{$IFDEF Use_CPPAN_Binaries} libleptonica = {$IFDEF Linux}'libpvt.cppan.demo.danbloomberg.leptonica-1.74.4.so'{$ELSE}'liblept-5.dll'{$ENDIF}; libtesseract = {$IFDEF Linux}'libpvt.cppan.demo.google.tesseract.libtesseract-master.so'{$ELSE}'libtesseract-4.dll'{$ENDIF}; {$ELSE}

I compressed the image to attach

multi-page.zip multi-page.pdf

r1me commented 6 years ago

I compared multi-page.pdf that I'm getting with yours, and can say with no doubt the issue is in Tesseract. OCR export to PDF is still under development of Tesseract, latest branch will even crash while trying to save to PDF file. I've made a copy of build dated 07-08-2017, this one seems to create searchable PDF files:

search_pdf

Thanks for finding this issue. I will monitor Tesseract development and update precompiled binaries on my server, once issue is fixed in the Tesseract.

ericduarte commented 6 years ago

Worked

Thanks for your help.