tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Prohibit do_ocr.py when ONLY .txt files remain in the root folder #62

Open ravidreams opened 8 years ago

ravidreams commented 8 years ago

Prohibit do_ocr.py when ONLY .txt files remain and .upload or .log files are NOT available in the root folder. We had one instance when a user tried to run doocr.py again when his connection was lost midway. He was actually uploading OCRed paged already to WS and had some text files remaining. So, he ended up overwriting already existing pages. And strangely, Google provided Gibberish this time - https://ta.wikisource.org/w/index.php?title=Page%3A%E0%AE%A4%E0%AE%A9%E0%AE%BF%E0%AE%B5%E0%AF%80%E0%AE%9F%E0%AF%81.pdf%2F79&type=revision&diff=97935&oldid=96015

The user should be prompted to run mediawiki_uploader.py after he makes sure that these pages are missing in WS index page.

Shreeshrii commented 7 years ago

I had the same problem just now. After do_ocr succeeded, it started uploading files to wikisource. That process failed after 316 pages. I restarted do_cor.py again thinking, it would start from where it had stopped. Instead it started the OCR of page 1 again.

What is the recommended workflow in such cases?

tshrinivasan commented 7 years ago

@Shreeshrii just run as

python mediawiki_uploader.py

This will do the upload work only.

Will fix the issue detailed by @ravidreams soon.

Shreeshrii commented 7 years ago

@tshrinivasan Thanks!

Will upload work, if I create the OCRed files locally on my PC using tesseract?

tshrinivasan commented 7 years ago

Currently no.

But can give you as a separate script.

Raise a new issue with your detailed requirements along with tesseracts output filename patterns.

Does tesseract support sa language ?

On Oct 14, 2017 4:11 PM, "Shreeshrii" notifications@github.com wrote:

@tshrinivasan https://github.com/tshrinivasan Thanks!

Will upload work, if I create the OCRed files locally on my PC using tesseract?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/62#issuecomment-336626602, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbOJfEIQe4alLD1ivGDzVepkvoiL_bks5ssI_ogaJpZM4HbCX4 .