tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Request script for uploading files OCRed locally using Tesseract #97

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 6 years ago

It will be useful to have a script which can upload locally available files to wikisource. These could be:

This should be allowed for a subset of page numbers also, but the page numbering should keep numbers as per the original source file.

4.0.0alpha version of tesseract (not yet officially released) has improved OCR of Indic languages. Hindi OCR is quite good. Sanskrit is OK. There is a Devanagari traineddata which has all languages written in Devanagari script + English.

tesseract output file name can follow any pattern. Usually I name it the same as the input file but with .txt extension. It does not support pdf as input, so pdf needs to be converted to tif.

  tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
Shreeshrii commented 6 years ago

Recent feedback is that google drive OCR is better for Indic languages compared to tesseract.

Closing this issue.