tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128) #19

Closed jayantanth closed 8 years ago

jayantanth commented 8 years ago

mediawiki_uploader_2016-01-05-08-19-35_log.txt mediawiki_uploader.py

Error Log

jayanta@jayanta-Inspiron-3541:~$ cd OCR jayanta@jayanta-Inspiron-3541:~/OCR$ python mediawiki_uploader.py INFO:main:Running mediawiki_uploader.py Version 1.31 INFO:main:URL = https://upload.wikimedia.org/wikisource/bn/2/2f/Testocrbengali.pdf INFO:main:Columns = 1 INFO:main:Wiki Username = jayantanth INFO:main:Wiki Password = Not logging the password INFO:main:Wiki Source Language Code = bn INFO:main:File Name = Testocrbengali.pdf INFO:main:File Type = pdf INFO:main:Original URL = https://upload.wikimedia.org/wikisource/bn/2/2f/Testocrbengali.pdf INFO:main:Wiki URL = https://bn.wikisource.org/w/api.php INFO:root:Login Status = True INFO:root:

Logged in to https://bn.wikisource.org Traceback (most recent call last): File "mediawiki_uploader.py", line 170, in pagename = filename + "/" + str(convert_to_indic(wikisource_language_code, pageno)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

mediawiki_uploader_2016-01-05-00-09-30_log.txt

tshrinivasan commented 8 years ago

This is caused by the issues with do_ocr.py not running.

Closing this now.

Reopen, if you still get any issue on mediawiki_uploader.py

jayantanth commented 8 years ago

The same error found.

ravidreams commented 8 years ago

mediawiki_uploader_2016-01-05-12-19-54_log.txt do_OCR.py successful.

Got the same error while running mediawiki_uploader.py

Log attached.

ravidreams commented 8 years ago

Facing same issue in or wiki too.

Just wonderingn if this is a problem because of the indian numerals in page number urls?

ravidreams commented 8 years ago

Just confirming that tested both in bn and or. It works now.