tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

File not download completely, but script start next steps #61

Open jayantanth opened 8 years ago

jayantanth commented 8 years ago

jayanta@jayanta-Inspiron-3541:~/OCR2$ python do_ocr.py INFO:main:Running do_ocr.py 1.50 INFO:root:Operating System = "Ubuntu 14.04.3 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%8D%E0%A6%AC%E0%A6%95%E0%A7%8B%E0%A6%B7_%E0%A6%B7%E0%A6%B7%E0%A7%8D%E0%A6%A0_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Keep Temp folder in Google Drive = yes INFO:main:Original URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Name = বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Type = djvu INFO:main:Created Temp folder OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-15-18-20-40

Downloading the file বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu

INFO:main:Downloading the file বিশ্বকোষ_ষষ্ঠখণ্ড.djvu INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): upload.wikimedia.org /usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl.py:315: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning. SNIMissingWarning /usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning. InsecurePlatformWarning [################################] 39935/82683 - 00:19:22 INFO:main:Download Completed INFO:main:Found a djvu file. Converting to PDF file.

ddjvu: Unexpected End Of File. ddjvu: Unexpected End Of File. ddjvu: Cannot decode page 378. INFO:main:Running ddjvu --format=pdf "বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu" "বিশ্বকোষ_ষষ্ঠ_খণ্ড".pdf INFO:main:Aligining the Pages of PDF file.

INFO:main:Running mutool poster -x 1 "বিশ্বকোষ_ষষ্ঠ_খণ্ড.pdf" currentfile.pdf error: cannot open বিশ্বকোষ_ষষ্ঠ_খণ্ড.pdf error: cannot load document 'বিশ্বকোষ_ষষ্ঠ_খণ্ড.pdf' uncaught exception: cannot load document 'বিশ্বকোষ_ষষ্ঠ_খণ্ড.pdf' INFO:main:Spliting the PDF into single pages.

Error: Unable to find file. Error: Failed to open PDF file: currentfile.pdf Done. Input errors, so no output created. INFO:main:Running pdftk currentfile.pdf burst INFO:main:Joining the PDF files ...

INFO:main: Creating a folder in Google Drive to upload files. Folder Name : OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-15-18-20-40

INFO:main:Running gdmkdir.py "OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-15-18-20-40" | tee folder_in_google_drive.log id: 0B1OpcVV-_vRSX19LWm8tM1FGd0E drive view: https://drive.google.com/drive/folders/0B1OpcVV-_vRSX19LWm8tM1FGd0E folder view: https://docs.google.com/folderview?id=0B1OpcVV-_vRSX19LWm8tM1FGd0E&usp=drivesdk INFO:main:Split the text files to sync with the original images INFO:main:Joining text files based on Column No INFO:main: Moving all temp files to OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-15-18-20-40

INFO:main:Running mv folder_.log currentfile.pdf docdata.txt pg.pdf page* txt* "OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-15-18-20-40" mv: cannot stat ‘currentfile.pdf’: No such file or directory mv: cannot stat ‘docdata.txt’: No such file or directory mv: cannot stat ‘pg.pdf’: No such file or directory mv: cannot stat ‘page_’: No such file or directory mv: cannot stat ‘txt’: No such file or directory INFO:main:Merged all OCRed files to all_text_for_বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu.txt INFO:main:Making a copy of all text files to text-for-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:Running cp .txt text-for-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:

Done. Check the text files start with text_forpage INFO:main:

The PDF files and result text files are equval. Now running the mediawiki_uploader.py script

INFO:main:Running do_ocr.py 1.50 INFO:root:Operating system = "Ubuntu 14.04.3 LTS"

INFO:main:URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%8D%E0%A6%AC%E0%A6%95%E0%A7%8B%E0%A6%B7_%E0%A6%B7%E0%A6%B7%E0%A7%8D%E0%A6%A0_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = JoyBot INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Edit Summary = OCRed INFO:main:File Name = বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Type = djvu INFO:main:Original URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:Wiki URL = https://bn.wikisource.org/w/api.php INFO:root:Login Status = True INFO:root:

Logged in to https://bn.wikisource.org INFO:root:Checking for bot access rights INFO:root:The user JoyBot has bot access. INFO:root: Done. Uploaded all text files to wiki source

mv: cannot stat ‘upload-*’: No such file or directory

jayantanth commented 8 years ago

First look at Downloading process section

[################################] 39935/82683 - 00:19:22 INFO:main:Download Completed

ravidreams commented 8 years ago

@jayantanth Please upload log file while reporting issues. Full error message need not be copy pasted. Thanks.

tshrinivasan commented 8 years ago

I too get similar issue for this file.

shrinivasan@shrinivasan-laptop:~/dev/wiki/wiki2ocr-testing/test8$ python do_ocr.py INFO:main:Running do_ocr.py 1.53 INFO:root:Operating System = "Ubuntu 15.04"

INFO:main:URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%8D%E0%A6%AC%E0%A6%95%E0%A7%8B%E0%A6%B7_%E0%A6%B7%E0%A6%B7%E0%A7%8D%E0%A6%A0_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = Tshrinivasan INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Keep Temp folder in Google Drive = no INFO:main:Original URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Name = বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Type = djvu INFO:main:Created Temp folder OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-23-06-54-20

Downloading the file বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu

INFO:main:Downloading the file বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): upload.wikimedia.org [################################] 80539/82683 - 00:11:54 Traceback (most recent call last): File "do_ocr.py", line 138, in for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): File "/usr/local/lib/python2.7/dist-packages/clint/textui/progress.py", line 115, in bar for i, item in enumerate(it): File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 653, in generate for chunk in self.raw.stream(chunk_size, decode_content=True): File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 256, in stream data = self.read(amt=amt, decode_content=decode_content) File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 186, in read data = self._fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 611, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 188, in recv data = self.connection.recv(_args, *_kwargs) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 995, in recv self._raise_ssl_error(self._ssl, result) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 851, in _raise_ssl_error raise ZeroReturnError() OpenSSL.SSL.ZeroReturnError

python do_ocr.py INFO:main:Running do_ocr.py 1.53 INFO:root:Operating System = "Ubuntu 15.04"

INFO:main:URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%8D%E0%A6%AC%E0%A6%95%E0%A7%8B%E0%A6%B7_%E0%A6%B7%E0%A6%B7%E0%A7%8D%E0%A6%A0_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1.djvu INFO:main:Columns = 1 INFO:main:Wiki Username = Tshrinivasan INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:Keep Temp folder in Google Drive = no INFO:main:Original URL = https://upload.wikimedia.org/wikipedia/commons/e/ea/বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Name = বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:main:File Type = djvu INFO:main:Created Temp folder OCR-বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu-temp-2016-02-23-07-12-25

Downloading the file বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu

INFO:main:Downloading the file বিশ্বকোষ_ষষ্ঠ_খণ্ড.djvu INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): upload.wikimedia.org [################################] 68425/82683 - 00:12:10 Traceback (most recent call last): File "do_ocr.py", line 138, in for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): File "/usr/local/lib/python2.7/dist-packages/clint/textui/progress.py", line 115, in bar for i, item in enumerate(it): File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 653, in generate for chunk in self.raw.stream(chunk_size, decode_content=True): File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 256, in stream data = self.read(amt=amt, decode_content=decode_content) File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 186, in read data = self._fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 611, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 188, in recv data = self.connection.recv(_args, *_kwargs) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 995, in recv self._raise_ssl_error(self._ssl, result) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 851, in _raise_ssl_error raise ZeroReturnError() OpenSSL.SSL.ZeroReturnError shrinivasan@shrinivasan-laptop:~/dev/wiki/wiki2ocr-testing/test8$

ravidreams commented 8 years ago

Faced this issue in a Tamil file. Only 1 MB of 14 MB file downloaded. Unable to open the file in system and it says it is a damaged PDF file. But, the tool split it into 800+ pages and completed the OCR :) The output text is totally irrelevant to the book. The book originally has 223 pages.

We ran it again and it came well - https://ta.wikisource.org/s/1314

Just informing here so as to make the tool more robust if we are moving to cloud.