tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

ddjvu: [1-15114] IFFByteStream not ready for reading chunk. #55

Closed jayantanth closed 8 years ago

jayantanth commented 8 years ago

installed 1.50 with bash ./setup.sh

nasirkhan@nasir:~/wiki/OCR4wikisource$ python do_ocr.py INFO:main:Running do_ocr.py 1.50 INFO:root:Operating System = "Ubuntu 14.04.3 LTS"

INFO:main:URL = https://bn.wikisource.org/wiki/image:OCR-test-1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = nasirkhan INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Keep Temp folder in Google Drive = yes INFO:main:Original URL = https://bn.wikisource.org/wiki/image:OCR-test-1.djvu INFO:main:File Name = image:OCR-test-1.djvu INFO:main:File Type = djvu INFO:main:Created Temp folder OCR-image:OCR-test-1.djvu-temp-2016-02-13-17-58-41

Downloading the file image:OCR-test-1.djvu

INFO:main:Downloading the file image:OCR-test-1.djvu INFO:urllib3.connectionpool:Starting new HTTPS connection (1): bn.wikisource.org [################################] 11/11 - 00:00:00 INFO:main:Download Completed INFO:main:Found a djvu file. Converting to PDF file.

ddjvu: [1-15114] IFFByteStream not ready for reading chunk. ddjvu: [1-15114] IFFByteStream not ready for reading chunk. ddjvu: Cannot decode document. INFO:main:Running ddjvu --format=pdf "image:OCR-test-1.djvu" "image:OCR-test-1".pdf INFO:main:Aligining the Pages of PDF file.

INFO:main:Running mutool poster -x 1 "image:OCR-test-1.pdf" currentfile.pdf error: cannot open image:OCR-test-1.pdf error: cannot load document 'image:OCR-test-1.pdf' uncaught exception: cannot load document 'image:OCR-test-1.pdf' INFO:main:Spliting the PDF into single pages.

Error: Unable to find file. Error: Failed to open PDF file: currentfile.pdf Done. Input errors, so no output created. INFO:main:Running pdftk currentfile.pdf burst INFO:main:Joining the PDF files ...

INFO:main: Creating a folder in Google Drive to upload files. Folder Name : OCR-image:OCR-test-1.djvu-temp-2016-02-13-17-58-41

INFO:main:Running gdmkdir.py "OCR-image:OCR-test-1.djvu-temp-2016-02-13-17-58-41" | tee folder_in_google_drive.log id: 0Bzu8oam42f2mY3hONEV3RzFyTGc drive view: https://drive.google.com/drive/folders/0Bzu8oam42f2mY3hONEV3RzFyTGc folder view: https://docs.google.com/folderview?id=0Bzu8oam42f2mY3hONEV3RzFyTGc&usp=drivesdk INFO:main:Split the text files to sync with the original images INFO:main:Joining text files based on Column No INFO:main: Moving all temp files to OCR-image:OCR-test-1.djvu-temp-2016-02-13-17-58-41

INFO:main:Running mv folder_.log currentfile.pdf docdata.txt pg.pdf page* txt* "OCR-image:OCR-test-1.djvu-temp-2016-02-13-17-58-41" mv: cannot stat ‘currentfile.pdf’: No such file or directory mv: cannot stat ‘docdata.txt’: No such file or directory mv: cannot stat ‘pg.pdf’: No such file or directory mv: cannot stat ‘page_’: No such file or directory mv: cannot stat ‘txt’: No such file or directory INFO:main:Merged all OCRed files to all_text_for_image:OCR-test-1.djvu.txt INFO:main:Making a copy of all text files to text-for-image:OCR-test-1.djvu INFO:main:Running cp .txt text-for-image:OCR-test-1.djvu INFO:main:

Done. Check the text files start with text_forpage INFO:main:

The PDF files and result text files are equval. Now running the mediawiki_uploader.py script

INFO:main:Running do_ocr.py 1.50 INFO:root:Operating system = "Ubuntu 14.04.3 LTS"

INFO:main:URL = https://bn.wikisource.org/wiki/image:OCR-test-1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = nasirkhan INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Edit Summary = testing... INFO:main:File Name = image:OCR-test-1.djvu INFO:main:File Type = djvu INFO:main:Original URL = https://bn.wikisource.org/wiki/image:OCR-test-1.djvu INFO:main:Wiki URL = https://bn.wikisource.org/w/api.php INFO:root:Login Status = True INFO:root:

Logged in to https://bn.wikisource.org INFO:root:Checking for bot access rights INFO:root:The user nasirkhan does not have bot access INFO:root: Done. Uploaded all text files to wiki source

mv: cannot stat ‘upload-*’: No such file or directory nasirkhan@nasir:~/wiki/OCR4wikisource$

http://paste.ubuntu.com/15035323/

nasirkhan commented 8 years ago

There was an issue with configuring the file_url. updated the config and the script is working fine.

tshrinivasan commented 8 years ago

Check the URL.

Always use the direct URL of files.

It should start as upload.commons. etc