tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

mutool-error #87

Closed tha-uzhavan closed 6 years ago

tha-uzhavan commented 8 years ago

In Version 1.54, mutool-error appears. Hence unble to create currentfile.pdf. KIndly refer the screenshot at Commons.

https://upload.wikimedia.org/wikipedia/commons/3/31/Wikisource-bn-OCR4wikisource-error-mutool-poster-x-1.png

The tested file is https://bn.wikisource.org/wiki/%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A7%8D%E0%A6%98%E0%A6%A3%E0%A7%8D%E0%A6%9F:%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%9A%E0%A6%A4%E0%A7%81%E0%A6%B0%E0%A7%8D%E0%A6%A5_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf

can be download at the following link https://upload.wikimedia.org/wikisource/bn/b/bf/%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%9A%E0%A6%A4%E0%A7%81%E0%A6%B0%E0%A7%8D%E0%A6%A5_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf

tshrinivasan commented 8 years ago

Share the url of the pdf file here.

Let me try this.

tha-uzhavan commented 8 years ago

https://bn.wikisource.org/wiki/%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A7%8D%E0%A6%98%E0%A6%A3%E0%A7%8D%E0%A6%9F:%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A6%A5%E0%A6%AE_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf

https://bn.wikisource.org/wiki/%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A7%8D%E0%A6%98%E0%A6%A3%E0%A7%8D%E0%A6%9F:%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%A4%E0%A7%83%E0%A6%A4%E0%A7%80%E0%A6%AF%E0%A6%BC_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf

tha-uzhavan commented 8 years ago

djvu file does not make this issue. See নির্ঘণ্ট:প্রবাসী (পঞ্চম ভাগ).djvu

jayantanth commented 8 years ago

jayanta@jayanta-Inspiron-3541:~/OCR$ python do_ocr.py INFO:main:Running do_ocr.py 1.54 INFO:root:Operating System = "Ubuntu 14.04.4 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikisource/bn/c/c9/%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A6%A5%E0%A6%AE_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Keep Temp folder in Google Drive = no INFO:main:Original URL = https://upload.wikimedia.org/wikisource/bn/c/c9/জীবনানন্দ_সমগ্র_(প্রথম_খণ্ড).pdf INFO:main:File Name = জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf INFO:main:File Type = pdf INFO:main:Created Temp folder OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29 INFO:root:জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf Already Exists. Skipping the download. INFO:main:Aligining the Pages of PDF file.

INFO:main:Running mutool poster -x 1 "জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf" currentfile.pdf error: expected object number error: expected object number error: Repair failed already - not trying again error: cannot parse object (1402 0 R) error: cannot load object (1402 0 R) into cache uncaught exception: cannot load object (1402 0 R) into cache INFO:main:Spliting the PDF into single pages.

Error: Unable to find file. Error: Failed to open PDF file: currentfile.pdf Done. Input errors, so no output created. INFO:main:Running pdftk currentfile.pdf burst INFO:main:Joining the PDF files ...

INFO:main: Creating a folder in Google Drive to upload files. Folder Name : OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29

INFO:main:Running gdmkdir.py "OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29" | tee folder_in_google_drive.log id: 0B1OpcVV-_vRSX1Ewc29kWUdSQ0E drive view: https://drive.google.com/drive/folders/0B1OpcVV-_vRSX1Ewc29kWUdSQ0E folder view: https://docs.google.com/folderview?id=0B1OpcVV-_vRSX1Ewc29kWUdSQ0E&usp=drivesdk INFO:main:Split the text files to sync with the original images INFO:main:Joining text files based on Column No INFO:main: Moving all temp files to OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29

INFO:main:Running mv folder_.log currentfile.pdf docdata.txt pg.pdf page* txt* "OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29" mv: cannot stat ‘docdata.txt’: No such file or directory mv: cannot stat ‘pg.pdf’: No such file or directory mv: cannot stat ‘page_’: No such file or directory mv: cannot stat ‘txt’: No such file or directory INFO:main:Merged all OCRed files to all_text_for_জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf.txt INFO:main:Making a copy of all text files to text-for-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf INFO:main:Running cp .txt text-for-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf sh: 1: Syntax error: "(" unexpected INFO:main: Deleting the Temp folder in Google Drive OCR-জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf-temp-2016-05-23-22-31-29

INFO:main:Running gdrm.py 0B1OpcVV-_vRSX1Ewc29kWUdSQ0E INFO:main:

Done. Check the text files start with text_forpage INFO:main:

The PDF files and result text files are equval. Now running the mediawiki_uploader.py script

INFO:main:Running do_ocr.py 1.54 INFO:root:Operating system = "Ubuntu 14.04.4 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikisource/bn/c/c9/%E0%A6%9C%E0%A7%80%E0%A6%AC%E0%A6%A8%E0%A6%BE%E0%A6%A8%E0%A6%A8%E0%A7%8D%E0%A6%A6_%E0%A6%B8%E0%A6%AE%E0%A6%97%E0%A7%8D%E0%A6%B0_%28%E0%A6%AA%E0%A7%8D%E0%A6%B0%E0%A6%A5%E0%A6%AE_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1%29.pdf INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Edit Summary = OCRed INFO:main:File Name = জীবনানন্দসমগ্র(প্রথম_খণ্ড).pdf INFO:main:File Type = pdf INFO:main:Original URL = https://upload.wikimedia.org/wikisource/bn/c/c9/জীবনানন্দ_সমগ্র_(প্রথম_খণ্ড).pdf INFO:main:Wiki URL = https://bn.wikisource.org/w/api.php INFO:root:Login Status = True INFO:root:

Logged in to https://bn.wikisource.org INFO:root:Checking for bot access rights INFO:root:The user JoyBot has bot access. INFO:root: Done. Uploaded all text files to wiki source

sh: 1: Syntax error: "(" unexpected sh: 1: Syntax error: "(" unexpected sh: 1: Syntax error: "(" unexpected

jayantanth commented 8 years ago

Tried in DJVU file

jayanta@jayanta-Inspiron-3541:~/OCR$ python do_ocr.py INFO:main:Running do_ocr.py 1.54 INFO:root:Operating System = "Ubuntu 14.04.4 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikisource/bn/e/e3/2.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = as INFO:main:Keep Temp folder in Google Drive = no INFO:main:Original URL = https://upload.wikimedia.org/wikisource/bn/e/e3/2.djvu INFO:main:File Name = 2.djvu INFO:main:File Type = djvu INFO:main:Created Temp folder OCR-2.djvu-temp-2016-05-23-23-08-26 INFO:root:2.djvu Already Exists. Skipping the download. INFO:root:Found PDF version. Skipping DJVU to PDF conversion INFO:main:Aligining the Pages of PDF file.

INFO:main:Running mutool poster -x 1 "2.pdf" currentfile.pdf error: cannot recognize version marker warning: trying to repair broken xref error: cannot tell in file error: cannot open document error: cannot load document '2.pdf' uncaught exception: cannot load document '2.pdf' INFO:main:Spliting the PDF into single pages.

Error: Unable to find file. Error: Failed to open PDF file: currentfile.pdf Done. Input errors, so no output created. INFO:main:Running pdftk currentfile.pdf burst INFO:main:Joining the PDF files ...

INFO:main: Creating a folder in Google Drive to upload files. Folder Name : OCR-2.djvu-temp-2016-05-23-23-08-26

INFO:main:Running gdmkdir.py "OCR-2.djvu-temp-2016-05-23-23-08-26" | tee folder_in_google_drive.log id: 0B1OpcVV-_vRSWnJ2NnpjMUZ2X1U drive view: https://drive.google.com/drive/folders/0B1OpcVV-_vRSWnJ2NnpjMUZ2X1U folder view: https://docs.google.com/folderview?id=0B1OpcVV-_vRSWnJ2NnpjMUZ2X1U&usp=drivesdk INFO:main:Split the text files to sync with the original images INFO:main:Joining text files based on Column No INFO:main: Moving all temp files to OCR-2.djvu-temp-2016-05-23-23-08-26

INFO:main:Running mv folder_.log currentfile.pdf docdata.txt pg.pdf page* txt* "OCR-2.djvu-temp-2016-05-23-23-08-26" mv: cannot stat ‘currentfile.pdf’: No such file or directory mv: cannot stat ‘docdata.txt’: No such file or directory mv: cannot stat ‘pg.pdf’: No such file or directory mv: cannot stat ‘page_’: No such file or directory mv: cannot stat ‘txt’: No such file or directory INFO:main:Merged all OCRed files to all_text_for_2.djvu.txt INFO:main:Making a copy of all text files to text-for-2.djvu INFO:main:Running cp .txt text-for-2.djvu INFO:main: Deleting the Temp folder in Google Drive OCR-2.djvu-temp-2016-05-23-23-08-26

INFO:main:Running gdrm.py 0B1OpcVV-_vRSWnJ2NnpjMUZ2X1U INFO:main:

Done. Check the text files start with text_forpage INFO:main:

The PDF files and result text files are equval. Now running the mediawiki_uploader.py script

INFO:main:Running do_ocr.py 1.54 INFO:root:Operating system = "Ubuntu 14.04.4 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikisource/bn/e/e3/2.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = as INFO:main:Edit Summary = OCRed INFO:main:File Name = 2.djvu INFO:main:File Type = djvu INFO:main:Original URL = https://upload.wikimedia.org/wikisource/bn/e/e3/2.djvu INFO:main:Wiki URL = https://as.wikisource.org/w/api.php INFO:root:Login Status = True INFO:root:

Logged in to https://as.wikisource.org INFO:root:Checking for bot access rights INFO:root:The user JoyBot does not have bot access INFO:root: Done. Uploaded all text files to wiki source

mv: cannot stat ‘upload-*’: No such file or directory jayanta@jayanta-Inspiron-3541:~/OCR$

jayantanth commented 8 years ago

PDF/DJVU the same error found, it need to fix the error urgently. Tried with different file name also.

jayantanth commented 8 years ago

Ok, I have fixed this issue. The issue was not in the script. The issue was in PDF file itself. The PDF file error was "There was a problem reading this document (109)". So may be mutool can't open the file.

So after extract all page to PNG, I have create PDF again from PNG. Now file is working fine.

tha-uzhavan commented 8 years ago

jayan! What commands you are ussing to convert?

jayantanth commented 8 years ago

@tha-uzhavan I have used Convert command of Imagemagic

Example : convert -density 300 MyImage.pdf MyImage.png

Then again convert MyallImage.png Mynew.pdf

tshrinivasan commented 6 years ago

@tha-uzhavan do you have the issue still?

shall we close this issue?