ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

Python3 compatibility #319

Open kba opened 5 years ago

kba commented 5 years ago

This is another go to make most of the engine compatible with python3, because ocropy is the last python2 holdout in the open source ecosphere. Even with development effectively ceased, it's still widely used and should at least be forward-compatible with python 3.4+.

Basically, we don't want to keep supporting Python2 in our stack just for ocropy :-)

Uses six as a compatibility library to handle things like unicode/byte strings, urlopen, pickle.

Updates the CircleCI configuration to the current format and tests across 2.7 and 3.4-3.7.

Since the test suite isn't that intuitive, I cannot guarantee I didn't break stuff.

I'm somewhat baffled that the default model (pickled with python2) would work for the python3 variant of rpred. I suspect an error in the setup, but see e.g the output in https://circleci.com/gh/OCR-D/ocropy/38, ctrl-f for # loading object ./models/en-default.pyrnn.gz. I remember @mittagessen predicting this to fail because of some of the old-style classes used but it doesn't...

This also updates scipy, numpy, matplotlib and introduces imageio to work against some of the DeprecationWarnings that have turned into Errors in more recent versions of the libraries.

Some hacky type conversions (like converting boolean image arrays to float32) need further scrutiny.

Feedback is appreciated.

@tmbdev @zuphilip @syedsaqibbukhari @QuLogic @amitdo @mittagessen @wrznr @finkf

kba commented 5 years ago

Published this PR as a fork to pypi to test integrations. If anyone's interested, it can be installed with

pip install ocrd-fork-ocropy==1.4.0a3
mittagessen commented 5 years ago

I only ran a cursory test but it seems to work.

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now. You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

amitdo commented 5 years ago

I'm fairly certain that unpickling old-style classes on python 3 didn't work at some point in the past but it eminently does now

https://bugs.python.org/issue5180

You might want to change all the class definitions in lstm.py to new style as python 3 classes auto-inherit from object now.

https://portingguide.readthedocs.io/en/latest/classes.html

mittagessen commented 5 years ago

https://portingguide.readthedocs.io/en/latest/classes.html

Sorry, I might've been a bit unclear. Old-style and new-style definitions for these classes are equivalent on py3 and updating these to the now common class foo(object) syntax is little more than a nice hint that the code has been touched in the last ten years. I doubt anybody has ever done multiple inheritance on these classes, so it shouldn't break anything on py2.7

zuphilip commented 5 years ago

Hey @kba this looks very interesting! I agree that it would be good to make sure that ocropy also runs on python3. It seems that it is now easier to achieve python2 and python3 compatibility with libraries like six. Let me ask some general questions before looking deeper at the code:

  1. Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)
  2. Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)
  3. Is the pip command still the same after your last commits?
  4. Did you see any difference in your tests for the new code or under Python 3?
  5. Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?
kba commented 5 years ago

Is it okay to first look into this PR also it will create merge conflicts with possibly every other PR? (I think: yes)

The conflicts will be substantial but trivial since structurally nothing should change, just syntax and some API calls in a lot of places.

Should we be concerned about @tmbdev that he could not want this PR? (I don't think so.)

Since this makes no essential changes except future-proofing the code, I wouldn't think so.

Is the pip command still the same after your last commits?

No, just updated it to c773dd2

pip install ocrd-fork-ocropy==1.4.0a3

Did you see any difference in your tests for the new code or under Python 3?

I did not, it works surprisingly (suspiciously :D) well across versions. I've slightly extended run-test-ci but the codebase being what it is, there is no guarantee nothing did break. If something broke through my changes, it broke for py2 and py3 since the log files from the test suite look the same to me.

Is this PR ready to review or are you still working on it (e.g. hacky type conversions)?

It is ready for review and I'm esp. happy for tips on how to do proper type conversions with numpy and PIL :) If you have input on what should be tested to avoid breaking stuff, also appreciated.

mittagessen commented 5 years ago

happy for tips on how to do proper type conversions with numpy and PIL

The "proper" way for PIL -> np.array conversion is to explicitly set the image mode to 'L' or 'RGB' to ensure getting a uint8_t array with channel depth 1/3 returned. For the other way around the nested switches in array2pil can probably be replaced by a simple Image.fromarray(ar) as it automatically determines image mode from shape and dtype of the array.