tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Handling with Google Drive uploading Error #38

Open jayantanth opened 8 years ago

jayantanth commented 8 years ago

Today I am running with 723 pages book, only two page stucked every time.

=========ERROR===========

INFO:main:Missing page_00099.txt INFO:main:page_00099.pdf should be reuploaded INFO:main:Missing page_00267.txt INFO:main:page_00267.pdf should be reuploaded INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

Then I have tried to upload manual method. the error in google drive itself to convert text. page_00099.pdf

screenshot from 2016-02-07 15 59 45

So now final issue is as of now....with out complete this job I could not run mediawiki_upload.py. Because there are no "text_for_page" file available. Every time its with sucked at page 99 and message will come "Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files " I know that this not your script issue directly, this is google drive issue.

tshrinivasan commented 8 years ago

Can you try manually uploading the missing pdf files and get text? Save the text in same name in the same folder.

Then run do_ocr.py to see there are no errors.

jayantanth commented 8 years ago

I have tried manually, but this is google drive error found in as above image.

tshrinivasan commented 8 years ago

in the same folder run the following commands.

touch page_00099.txt touch page_00099.upload

this will skip the file from do_ocr.py to check.

do the same for all missing files to create empty files.