Request script for uploading files OCRed locally using Tesseract

It will be useful to have a script which can upload locally available files to wikisource. These could be:

files OCRed using do_ocr.py - google drive - and then post-processed/modified locally
files OCRed using tesseract-ocr

This should be allowed for a subset of page numbers also, but the page numbering should keep numbers as per the original source file.

4.0.0alpha version of tesseract (not yet officially released) has improved OCR of Indic languages. Hindi OCR is quite good. Sanskrit is OK. There is a Devanagari traineddata which has all languages written in Devanagari script + English.

tesseract output file name can follow any pattern. Usually I name it the same as the input file but with .txt extension. It does not support pdf as input, so pdf needs to be converted to tif.

  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.

tshrinivasan / OCR4wikisource

Request script for uploading files OCRed locally using Tesseract #97