jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

Support Python 3 #39

Open madalu opened 4 years ago

madalu commented 4 years ago

Thanks so much for this excellent software! I have been using it for years to run OCR on scans and it has never failed me.

Would it be possible to add Python 3 support? Unfortunatly, Python 2 development has been officially frozen and Python 2 will no longer receive updates: https://www.python.org/doc/sunset-python-2/

blaueente commented 4 years ago

This issue has become worse, as Ubuntu 20.04 specifically does not offer pip for python anymore, so even "manual" non-package installation is becoming very difficult.

stweil commented 3 years ago

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

bastien-roucaries commented 3 years ago

@jwilk, you added the "wontfix" label. Would you accept pull requests which replace Python2 by Python3 support?

@stweil go ahead, I will include this patch for debian and ubuntu if not upstream

jsbien commented 3 years ago

Great!

bastien-roucaries commented 3 years ago

@jsbien @stweil I can get a pull request from here if more convenient. But I lack time to do myself the patch

bastien-roucaries commented 3 years ago

Fixed by pull request

bastien-roucaries commented 3 years ago

@jsbien @stweil @jwilk @madalu Could you test and review

jsbien commented 3 years ago

A quick test is OK, thanks. BTW, please update the doc/dependencies file.

Dominic-Mayers commented 3 years ago

@bastien-roucaries In my case, I could not extract the hocr from a djvu file using djvu2hocr. It complained that the argument to write was bytes instead of string. Note that the method encode converts string to bytes in the given encoding. I had to make the following modifications to '''lib/cli/djvu2hocr.py''':

At line 331, replace sys.stdout.write(hocr_header.encode('UTF-8')) with sys.stdout.write(hocr_header)

At line 345, replace sys.stdout.write(hocr_footer.encode('UTF-8')) with sys.stdout.write(hocr_footer)

At line 277, replace tree.write(sys.stdout) with tree.write(sys.stdout.buffer)

faridcher commented 3 years ago

@bastien-roucaries yes, @Dominic-Mayers's changes are needed to workaround an error. Now it works fine in my Debian machine.

> ~/src/py/ocrodjvu$ djvu2hocr ~/99tech.djvu 
Converting /home/farid/fin/stock/books/murphy/99tech.djvu:
Traceback (most recent call last):
  File "/usr/local/bin/djvu2hocr", line 26, in <module>
    cli.main(sys.argv)
  File "/usr/local/share/ocrodjvu/lib/cli/djvu2hocr.py", line 331, in main
    sys.stdout.write(hocr_header.encode('UTF-8'))
TypeError: write() argument must be str, not bytes
jsbien commented 3 years ago

I confirm.

FYI, I tried to convert the resulting hOCR with hocr2djvused and got lib.errors.MalformedHocr: malformed hOCR document: page without bounding box information I understand this is not related to the Python version.

rmast commented 2 years ago

I saw four forks with a Python3-conversion I merged the successful parts

The remaining issues are string/bytes issues with the optional ocrad and gocr. I guess there has to be done something with TextIOWrapper in common.py to adapt the output of tesseract.py, cuneiform.py, ocrad.py and gocr.py.

You can see the remaining issues with

make test

or more specifically:

nosetests tests.ocrodjvu.test_integration:test_ocr
rmast commented 2 years ago

I made the tests for gocr and ocrad work as well. For the gocr output I used BytesIO instead of StringIO. All tests run fine now, and I updated the coverage. As far as I'm concerned anyone could try the python3 branch in my fork.

rmast commented 2 years ago

We should probably try to get it working on Python 3.10 as well: https://github.com/jwilk/python-djvulibre/issues/13

FriedrichFroebel commented 2 years ago

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

rmast commented 2 years ago

Inside the @rmast fork, GitHub Actions have not been integrated yet, as well as some upstream changes, and issues are disabled (which is the default for forks). With the planned upstream retirement (#46) from both ocrodjvu and python-djvulibre, it seems like this partly active fork might remain the best choice.

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together. My last summer holiday I've spent time on improving the MRC-compression of ocrmypdf by using the djvu-tricks of these JWilk-repo's, not only using tesseract, but also easyocr for segmentation details of text-parts to the foreground. Unfortunately my first proof of concept got late due to struggling with cython and memory management during custom otsu-histograms, so my holiday was over before the POC was live.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I still regularly use both didjvu and ocrodjvu (although only on Manjaro Linux with Python 2, which causes more and more pain when updating packages), so I thought about actually using the fork. It would probably need some work to incorporate the upstream changes nevertheless, as well as some modifications to make it compatible with the latest Python versions (as I already did for didjvu, although this Python 3 fork might become obsolete as well in the far future, due to gamera4 relying on the deprecated distutils package). While I can imagine to at least maintain a basic version of ocrodjvu as well, I am not familiar with most of the underlying stuff at the moment.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it. I tend to get comparable open source functionality into similar PDF MRC compression. PDF is what I use when I scan in a document and spread it among my peers.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

I was able to revive a functional pip-installer for python 2.7 as the main pip-download doesn't support 2.7 anymore.

rmast commented 2 years ago

I read your issues in the Gamera-4 repo. There are more issues and support might be dropped as well, mostly due to Python as moving target, just as with these jwilk-repos. We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers? My effort this summer was giving live to even another binarizer, based on otsu of easyocr-segments.

FriedrichFroebel commented 2 years ago

I forked some stuff to protect them for a maintainer taking them offline. I don't know if I will have enough time to be the main maintainer of those forks. I have no experience with Github Actions. We could try to do it together.

There is GitHub Actions support in this upstream repository now, so this might be limited to mostly copy-and-paste, although some changes are required (see my didjvu fork for example). I might have a look at it and might decide to "modernize" the code as I did for didjvu as well in the case I find enough time to do so.

By the way, didjvu is not mentioned in the jwilk-retirement-message.

I am aware of that.

I´m not an active user of any of these tools, only melancholic about losing useful algorithms thought out before for MRC-compression. As with didjvu, where you did most of the migration, I can try to fix things that might confuse you, but be prepared to drop lots of functionality you don't use yourself, unless someone else claims to still be using it.

I use both didjvu and ocrodjvu on a regular basis at the moment - and I might keep maintaining at least the bits which I actually use as far as I am able to. While I am rather familiar with Python development, the actual DJVU and image processing stuff is something I only have a rough understanding of.

Gamera4 still has minimal maintenance, so distutils might still be taken care of.

If you look at the corresponding issue there, future is not really clear. I just started fixing some deprecated stuff to test Python 3.11 compatibility, but especially with distutils the migration path for some functionality is not even clear for the upstream developers.

We might inspect the dependencies of didjvu on Gamera-4. As far as my attention has been concerned until now the main dependency is on the djvu-binarizer. Do you actively use any of those other binarizers?

I just looked through the code of didjvu: It seems like the only important imports are from gamera.plugins.threshold and gamera.plugins.binarization, while I only use the default djvu_threshold implementation in my cases. But the didjvu stuff is out of scope here anyway.

FriedrichFroebel commented 2 years ago

I just did my first real test with the aggregated Python3 port. Apart from the fact that the requirements.txt file misses the regex and the future module, at least my tests worked without any issues.

rmast commented 2 years ago

Nice.

We did those upgrade activities to make those repos survive the deprecation of python 2.7, and I'm glad they do.

rmast commented 2 years ago

The main reason for deprecating for example JWilk and Gamera-repo's would be Python as moving target, which is too tedious to follow.

I wonder whether converting them to a more solid language would be able to preserve them.

It's probably too much of an effort right now, but there are AI solutions for translation Python to C++ or Java nowadays:

https://morioh.com/p/81aa0e33b28a [https://i.ytimg.com/vi/cKUEvbzcCQ4/maxresdefault.jpg]https://morioh.com/p/81aa0e33b28a Convert Python code to Java & C++ with AI Code Translator by Facebook - Moriohhttps://morioh.com/p/81aa0e33b28a How to Install OpenJDK 11 on CentOS 8 What is OpenJDK? OpenJDk or Open Java Development Kit is a free, open-source framework of the Java Platform, Standard Edition (or Java SE). morioh.com

FriedrichFroebel commented 2 years ago

This still is a matter of taste and of the actual code base. If the code has been modernized, there should not be any real issues for plain Python code. The biggest problems mostly arise from Python 2 code which has been made compatible to Python 3, but never actually modernized. From my experience, Python 3 tends to be rather stable, except that its C/C++ APIs might change (as we see for gamera-4). For this reason, maintaining Python code should mostly be easy enough.