goncalopp / simple-ocr-opencv

A simple python OCR engine using opencv
GNU Affero General Public License v3.0
525 stars 175 forks source link

Merge RedFantom/python3 into master for Python 3 compatibility #23

Closed RedFantom closed 7 years ago

RedFantom commented 7 years ago

Using the 2to3 tool, manual changes and the fix for the int division that @goncalopp found and mentioned in #14, the changes in this branch make the code compatible with Python 3.

The tests pass both on Linux (Kubuntu 17.04 64-bit) and Windows (Windows 8.1 64-bit) with both Python 2.7 and Python 3.5 for me.

RedFantom commented 7 years ago

It appears Travis-CI has problems with the Python 3 environments. Should we remove those from the file? I can't find any issue reports about it in the travis-ci repo.

goncalopp commented 7 years ago

About Travis: found this issue https://github.com/travis-ci/travis-ci/issues/4260

Can you try with python 3.2 only? It's not ideal, but definitely better than not having CI for python3

RedFantom commented 7 years ago

OK, so the Travis-CI Linux distribution doesn't use the manylinux python-opencv wheel, unfortunately. I'll try compiling the stuff again.

RedFantom commented 7 years ago

OK, the branch is quite a bit of mess. I will clean up and do a push --force. Just so you know!

RedFantom commented 7 years ago

So, the tests now fail on both my own machine, and my Kubuntu VM, and Travis-CI, all with this error:

  File "/home/travis/build/RedFantom/simple-ocr-opencv/tests/test_ocr.py", line 32, in test_ocr_unicode
    self._test_ocr(ImageFile('unicode1'), ImageFile('unicode1'))
  File "/home/travis/build/RedFantom/simple-ocr-opencv/files.py", line 61, in __init__
    self.ground.read()
  File "/home/travis/build/RedFantom/simple-ocr-opencv/files.py", line 38, in read
    self.classes, self.segments = read_boxfile(self.path)
  File "/home/travis/build/RedFantom/simple-ocr-opencv/tesseract_utils.py", line 15, in read_boxfile
    return classes_to_numpy(classes), segments_to_numpy(segments)
  File "/home/travis/build/RedFantom/simple-ocr-opencv/classification.py", line 20, in classes_to_numpy
    int_classes = array.array("I", "".join(classes).encode('utf-32')[4:])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 0: ordinal not in range(128)

I will try to find the cause, but it probably has something to do with the recent changes to master, because they weren't there before.

Update: Oh, of course, it's actually quite obvious: the new tests with the unicode characters are throwing the errors.

RedFantom commented 7 years ago

OK, so the changes from chr to unichr did not affect the raised errors, unfortunately.

Also, I have a question, @goncalopp: In Python 3, all strings are unicode. Doesn't chr do exactly the same as the old unichr and isn't str the same call as unicode was in Python 2?

goncalopp commented 7 years ago

Oh, of course, it's actually quite obvious: the new tests with the unicode characters are throwing the errors.

Yes - I added the tests because I knew your code would brake them ;) I think I annotated all the places that need to be changed in the review

goncalopp commented 7 years ago

Also, I have a question, @goncalopp: In Python 3, all strings are unicode. Doesn't chr do exactly the same as the old unichr and isn't str the same call as unicode was in Python 2?

Yes, yes and yes. It's just that the code still needs to work in python2 :) At least if you want to merge it into master In python2, chr and str need to be in the ascii range. You can use six to have python2/3 cross-compatibility

RedFantom commented 7 years ago

@goncalopp Ah, of course.

Passing an encoding argument fixed the UnicodeDecodeError, but now there another error thrown with the unicode test on Windows:

  File "C:\Users\RedFantom\Source Code\simple-ocr-opencv\tests\test_ocr.py", line 32, in test_ocr_unicode
    self._test_ocr(ImageFile('unicode1'), ImageFile('unicode1'))
  File "C:\Users\RedFantom\Source Code\simple-ocr-opencv\tests\test_ocr.py", line 25, in _test_ocr
    self.assertEqual(chars, reconstruct_chars(ground_truth))
AssertionError: '\u16a0\u16c7ðþ\u03b7\u03b3\u03bb\u03c3\u03b1\u043b\u043b\u044c\u0433\u044f\u10d5\u10d4\u10de\u10d8\u10e1\u0baf
çæ\u0250\u025c\u053f\u0580\u0576\u0561\u0574\u0631\u0649\u0d91\u0dba\u03b3\u0db1\u0e94\u0e8d\u6211\u4e0b\u800c\u8eab\u1546\u152
d\u0259\u0153\u20ac' != '\u16a0\u16c7ðþ\u03b7\u03b3\u03bb\u03c3\u03b1\u0434\u043b\u044c\u0433\u044f\u10d5\u10d4\u10de\u10d8\u10
e1\u0bafçæ\u0250\u025c\u053f\u0580\u0576\u0561\u0574\u0631\u0649\u0d91\u0dba\u0db1\u0e94\u0e8d\u6211\u4e0b\u800c\u8eab\u1546\u1
52d\u0259\u0153\u20ac'
- \u16a0\u16c7ðþ\u03b7\u03b3\u03bb\u03c3\u03b1\u043b\u043b\u044c\u0433\u044f\u10d5\u10d4\u10de\u10d8\u10e1\u0bafçæ\u0250\u025c\
u053f\u0580\u0576\u0561\u0574\u0631\u0649\u0d91\u0dba\u03b3\u0db1\u0e94\u0e8d\u6211\u4e0b\u800c\u8eab\u1546\u152d\u0259\u0153\u
20ac
goncalopp commented 7 years ago

The same code on py3+cv3 on windows and linux has different results? That's a bummer... Here's the text I got from your message:

ᚠᛇðþηγλσαлльгяვეპისயçæɐɜԿրնամرىඑයγනດຍ我下而身ᕆᔭəœ€ ᚠᛇðþηγλσαдльгяვეპისயçæɐɜԿրնամرىඑයනດຍ我下而身ᕆᔭəœ€ It looks like it's misclassifying л for д, and adding a y ? I can kind of understand the first, but the second is a bit weird

Try running example.py with unicode1 as both the train and test files - it might give you some insight. Tweaking parameters there might be helpful.

RedFantom commented 7 years ago

Perhaps it has to do with this part of the box file?

ն 71 77 19 25 0
ա 124 78 19 23 0
մ 8 83 16 24 0
ر 44 83 25 19 0
ى 89 87 20 18 0
එ 107 91 16 17 0
ය 143 83 20 18 0

It swapped the characters and 0's for two characters, it seems. At least, this is what it looks like in PyCharm. In Notepad++, it is completely messed up, and in Atom too it shows like in PyCharm. In Atom I can't edit the lines with those characters on it, and with PyCharm I can only delete them. Would this be a problem with the fact that I have a Dutch Windows installation that doesn't support these characters at all or something?

goncalopp commented 7 years ago

It swapped the characters and 0's for two characters, it seems. At least, this is what it looks like in PyCharm.

As far as I can tell (with a hex editor and python), the box file is correct. The characters appear swapped because some editors (including github's) detect those two letters (correctly) as being arabic, and reverse the (visualization of the) entire line from right to left.

Here's how those 2 lines look in python:

u'\u0631 44 83 25 19 0',
u'\u0649 89 87 20 18 0',

So I don't think this is related (unless your python is giving you different results)

RedFantom commented 7 years ago

@goncalopp So, the problem is occurring on AppVeyor too. It is probably a decode error, not a recognition error. The error does not occur when running the master branch tests on my machine, so it does have something to do with the six.text_type against the Python 2 str or six.chr compared to unichr in Python 2.

Update: Using past.types.str and past.builtins.unichr also don't make the code work on Windows.

Update: Or perhaps the errors are completely unrelated to the branch changes after all. In that case, in my opinion, this branch should be merged regardlessly if you're happy with the result, and we should take a look at the errors in another branch.

goncalopp commented 7 years ago

@RedFantom

Or perhaps the errors are completely unrelated to the branch changes after all. In that case, in my opinion, this branch should be merged regardlessly if you're happy with the result, and we should take a look at the errors in another branch.

Yes, since this passes on py2/3 cv2/3 on linux, I'm happy merging it in :) I left you a couple of comments on minor things

RedFantom commented 7 years ago

@goncalopp As I am away from my computer at the moment ( :( ) , I can't implement the changes myself at the moment, but I do agree with all of them. If you want, you can make them, or I will later this week.

I am sorry there have to be so many changes, I'm learning a lot about developing a library from you!

RedFantom commented 7 years ago

@goncalopp The branch fails to complete the tests because of the from six.moves import input import. Without it, it does just fine.

goncalopp commented 7 years ago

This is looking good, thank you!

Actually, I'm sorry for being so picky! But I'm glad that you're learning things in the process :) The code is improving a lot thanks to you

The reason input mocking isn't working is that you're importing it directly into the module (grounding). mock.path works by overriding attribute access, so it only works when the attribute (moves.input) happens after the patching. I've added small changes on 0608d72b67b215ceec95c21085e462b57da61228. Can you please take a look and, if you agree, cherry-pick it?

RedFantom commented 7 years ago

@goncalopp That sounds logical! I have cherry-picked your commit, it looks good and it doesn't make any changes to the test results. Thank you for checking that out!

Being picky is a good thing when it comes to libraries: the code must be readable and as consistent as possible to allow other users to modify the code without having to work through a lot of pad practices!