docs: Tesseract use is not "automatic" for pdfs without additional config

edavidaja commented 5 years ago

This may very well be a "works on my machine" situation but I'm finding that some config is required to get tika to OCR as part of the batch process even after tesseract is installed.

# where tika-config.xml is:
tika_config_xml <- '<?xml version="1.0" encoding="UTF-8"?>
  <properties>
    <parsers>
      <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="extractInlineImages" type="bool">true</param>
      </params>
    </parser>
  </parsers>
</properties>
'
tika_config <- tempfile(fileext = ".xml")
scanned_pdf <- tempfile(fileext = ".pdf")
zipped_pdf  <- tempfile(fileext = ".zip")

cat(tika_config_xml, file = tika_config)
download.file("https://jeroen.github.io/images/ocrscan.pdf", destfile = scanned_pdf,mode = "wb")
zip(zipfile = zipped_pdf, files = scanned_pdf, )

# vanilla
rtika::tika("https://jeroen.github.io/images/testocr.png")
#> [1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n\n"

# ocr, no custom config
rtika::tika(scanned_pdf)
#> [1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSample Scanned Image\n\n\nSticky Note from Paperless\nThis is a sample page scanned at 200dpi and converted to PDF. It is not searchable. That is, all you see is the original image of the source document.\n\n\n\tPrint\n\tExit\n\n"

# custom tika config
rtika::tika(scanned_pdf, args = c("-c", tika_config))
#> [1] "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSample Scanned Image\n\n\nNo.\n\n1\n\nTHE SLEREXE COMPANY LIMITED\n\nSAPORS LANE - BOOLE - DORSET - BH 25 8ER\nTELEPHONE BOOLE (945 13) 51617 - TELEX 123456\n\nOur Ref. 350/PJC/EAC 18th January, 1972.\n\nDr. P.N. Cundall,\nMining Surveys Ltd.,\nHolroyd Road,\nReading,\n\nBerks.\n\nDear Pete,\n\nPermit me to introduce you to the facility of facsimile\ntransmission.\n\nIn facsimile a photocell is caused to perform a raster scan over\nthe subject copy. The variations of print density on the document\ncause the photocell to generate an analogous electrical video signal.\nThis signal is used to modulate a carrier, which is transmitted to a\nremote destination over a radio or cable communications link.\n\nAt the remote terminal, demodulation reconstructs the video\nsignal, which is used to modulate the density of print produced by a\nprinting device. This device is scanning in a raster scan synchronised\nwith that at the transmitting terminal. As a result, a facsimile\ncopy of the subject document is produced.\n\nProbably you have uses for this facility in your organisation.\n\nYours sincerely,\n\nThA.\nP.J. CROSS\nGroup Leader - Facsimile Research\n\nRegistered in England: No. 2088\nRogistered Office: 80 Vicara Lane, Ilford. Eseex.\n\n\nSticky Note from Paperless\nThis is a sample page scanned at 200dpi and converted to PDF. It is not searchable. That is, all you see is the original image of the source document.\n\n\n\tPrint\n\tExit\n\n"

^{Created on 2018-12-14 by the reprex package (v0.2.1)}

goodmansasha commented 5 years ago

Thanks for this! I’ll sort this out for the next release.

Oneiricer commented 5 years ago

Hi, I would like to add I am having the same problem as edavidaja. However running his tika-config.xml code did not solve my problem. I am new to R and do not have any Java experience, however I suspect the issue I have is due to not having tika installed (i only have your rtika package installed).

What command/syntax do i enter in order to 'force' rtika to do OCR?

goodmansasha commented 5 years ago

After looking into this, I think the OCR only works on Linux. I've tried on OS X and Windows.

@edavidaja I think this config file should be the default. Apparently, very large PDF files can cause memory problems with tesseract, which is apparently why it was switched off. However, the OCR should be consistently turned on. I've made PDF OCR the default behavior, and also included a config file that could be used instead to turn off OCR. This is in the latest version of rtika here.

devtools::install_github("ropensci/rtika")

ropensci / rtika

docs: Tesseract use is not "automatic" for pdfs without additional config #10