jwilk / didjvu

DjVu encoder with foreground/background separation
https://jwilk.net/software/didjvu
GNU General Public License v2.0
10 stars 8 forks source link

Please move to Python 3 #13

Open markuschaaf opened 5 years ago

markuschaaf commented 5 years ago

Python 2 will be EOL end of 2019. Distributions will stop shipping it. https://pythonclock.org/

jwilk commented 5 years ago

I don't have any plans to port didjvu to Python 3. Python 2 is a fine language and the motions to remove it from distros are ill-advised.

FriedrichFroebel commented 5 years ago

For me it seems like porting didjvu to Python 3 or making it compatible with Python 2 and Python 3 should be relatively easy once gamera supports Python 3 as well (see https://github.com/hsnr-gamera/gamera/issues/19).

blaueente commented 4 years ago

I don't have any plans to port didjvu to Python 3. Python 2 is a fine language and the motions to remove it from distros are ill-advised.

Python 2 is not maintained anymore regarding security. This means distros do not have a choice.

mara004 commented 4 years ago

Gamera developer @cdalitz says that the main branch has already been ported completely to Python 3 (https://github.com/hsnr-gamera/gamera-4), however it is marked as 'experimental' in the description and it doesn't seem to have an official release yet. @jwilk Could you please consider porting didjvu to Python 3 anyway? Python 2 is rarely used nowadays, and, as @blaueente pointed out, all major distributions are about to remove it or have done so already because they have to regarding security. I barely know of any other reasonably popular program which is still maintained and deliberately keeps using Python 2 ...

cdalitz commented 4 years ago

Concerning the python 3 port of Gamera (gamera-4), this is indeed finished. It is nevertheless still marked as "experimental", because it is not extensively tested. As I no longer use Gamera myself in any of the projects that I currently work on, I do not have the opportunity to test and fix it. Thus, if someone finds any bugs, patches for fixing them are highly welcome.

mara004 commented 4 years ago

@cdalitz Okay, thanks for clarifying!

jsbien commented 3 years ago

I understand a virtual environment for python 2 can be created on e.g. stable Debian and the program run inside it. I will appreciate a fool-proof instruction how to actually do it.

mara004 commented 3 years ago

@jsbien Python 2 is still available as official Debian package up to sid, so you probably don't have to worry about Python 2 for (at least) the next 5 years if you're on Debian. I'm not sure why they decided to keep Python 2 so long, though - an unmaintained programming language interpreter is a rather big security risk after all.

In general I think it might be better just not to use the djvu format anymore. The vast majority of djvu software is unmaintained, and outside the linux/bsd scope there are very few programs left that can open djvu at all. You can also achieve good compression ratios with PDF, which is a much more compatible format.

jsbien commented 3 years ago

@mara004 As for DjVu: djview4 and djvulibre is very well maintained, and new software is created, e.g. https://github.com/trufanov-nok/minidjvu-mod/. For me the compression ratio is the least important feature of DjVu, it has a lot of other advantages which are demonstrated by our tools such as https://github.com/jsbien/djview4shapes and https://bitbucket.org/mrudolf/djview-poliqarp. Their use it demonstrated e.g. by https://github.com/jsbien/iLindeCSV and https://github.com/jsbien/Zaborowski-index4djview.

mara004 commented 3 years ago

I won't deny there is still some active djvu software, but it seems most of it is rather intended for research than for practical use. Development of djvulibre has been slowing down a lot, and the djvu format is barely used compared to PDF or TIFF. Since most macOS, Windows or mobile users won't be able to open djvu, it is also very unsuitable for sharing.

cdalitz commented 3 years ago

At least Gamera has been ported to Python 3 (use the Gamera 4 version). If you encounter any problems with Gamera under Python 4, please consder filing a bug report there. This should thus not be an obstacle to porting djvu to Python 3, I think.

jsbien commented 3 years ago

I made some experiments with Gamera 4 and encountered no problems. Bastien Roucariès, who already ported ocrodjvu to Python 3, suggested "shotgun porting" of didjvu:

Use the testsuite, and the automatic conversion tool from python porting. Fix every bug that show during test suite and voila. It take me two your to fix the previous package.

Anybody willing to try this approach?

FriedrichFroebel commented 3 years ago

I just had a look at porting didjvu to Python 3, with the following issues arising:

rmast commented 3 years ago

@FriedrichFroebel wrote:

I just had a look at porting didjvu to Python 3

I don't see your fork?

rmast commented 3 years ago

@mara004 I agree PDF is much more common, and I guess if you put the MRC-djvu result of didjvu through DjVuToy to translate it to PDF it will not be much bigger with JP2000 instead of the FG44 IW44 image masked by the JBIG2 in a similar way as is done in a multilayer DjVu.

FriedrichFroebel commented 3 years ago

@rmast I did this on an old clone of this repository back then for testing and realized the aforementioned porting issues, so I did not upload these changes to GitHub. The incomplete/partially broken Python 3 port is now available in my fork.

rmast commented 3 years ago

Thanks! I have too little Python-experience to do the full port myself, but I can focus on details.

I compiled Gamera-4 yesterday and ran 2to3 on didjvu, but I already got stuck on some arguments, which seems to be a quite standard porting issue, however I don’t know a site to look up all porting-errors and corresponding fixes. Do you know how to operate libcst to do most of the work?

rmast commented 3 years ago

The fork of @FriedrichFroebel just does the job in python3.8 on Mint 20.2 when I run didjvu encode, after compiling and installing Gamera-4 without wx. https://github.com/hsnr-gamera/gamera-4

I don't know how to call it to reproduce the issues that @FriedrichFroebel thought were still there?

Edit: I found it: run make test

It only gives a test-issue with tests.test_gamera.test_to_pil_rgb.test_color. So the output has to be judged to be able to point to the right repo to solve it.

rmast commented 3 years ago

@FriedrichFroebel

This shows the way to see the ycbcr-jpeg.tiff contains a given colorspace:

exiftool -S -PhotometricInterpretation didjvu/tests/data/ycbcr-jpeg.tiff PhotometricInterpretation: YCbCr

Both In.jpg and Out.jpg appear not to have any Colorspace information: od -A x -t x1z -v out.jpg

gives no APP* or whatever segment markers popping up in the right column as described here: https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format

So also no Adobe APP14 marker which could distinguish between RGB and YCbCr. https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048

However the default color scheme for JPEG is YCBCR.

The documentation of to_pil says it only supports RGB and Grayscale. So putting in a YCbCr image probably already leads to an undefined situation.

The tested code seems to try to replace some Gamera-bugs, or try to speed up to_pil with a custom to_pil_rgb.

They might have a history in the commits that tells more about what happened and why they're introduced in the first place.

FriedrichFroebel commented 3 years ago

Glad to see that the port is working, as I have only used the tests before (after 2to3 conversion and some manual fixes). While I have no clean solution for the aforementioned issues (https://github.com/jwilk/didjvu/issues/13#issuecomment-918048197), I do not feel like a PR makes sense - besides the fact that Python 3 support does not seem to be considered useful by upstream.

I am clearly not an expert on the colorspace stuff, so there is not much I can say about it. The commit history for the Gamera support does not seem to tell us much about it as well: https://github.com/jwilk/didjvu/commit/2337b8fe7429ba99206d1e971c76a9f2f3686f48, https://github.com/jwilk/didjvu/commit/fdd6bf9e069a012eb446dba14c228e414cb44213.

rmast commented 3 years ago

@FriedrichFroebel As I read those commits you pointed at it might be just an optimization step that made the assumption of RGB necessary, while most real-life images are usually YCrCb. Reverting exact those two commits you pointed at drops the assumption on RGB, and even the failing test that comes with it.

Edit: unfortunately the program then fails on the inputpicture not being RGB.

The only thing that should be thoroughly tested then is behavior with source images of different color spaces, however usually images in scanned input will behave consistently, so if the colorspace fails someone will know at first try.

A PR is not necessary at te moment, as the Ubuntu 18-trick of getting the old dropped python-gamera package to work on Ubuntu 20 with Python 2.7 is still valid.

As soon as a valid python-gamera package is not reachable that way anymore because some dependencies of the Ubuntu 18 package get upgraded @jwilk will have to decide how to keep the didjvu usable. That might be with the introduction of Ubuntu 22.04 LTS next year, which might even raise the bar further on the supported Python-version, and deprecate 2to3.

The package maintainer of Debian has abandoned python-gamera as has has its maintainer.

gamera-4 might get out of Alpha at some moment, that would be the moment to put effort in the upstream again, and probably even put effort in getting gamera-4 back in the debian packages.

I committed some python3-changes to my fork of the python3 branch as well, for getting the 'bundle' function to work properly.

I also made another branch for supporting minidjvu-mod with the -2 parameter to call when --pages-per-dict > 1.

However, even with minidjvu-mod in place I see only a small reduction of the size. The resulting djvu-filesize is still way bigger than I would expect from DjVuSolo 3.1. When I scan a letter with a colored logo, an autograph and some colored text on the bottom there is mostly lots of blur on the background-picture, but it takes way too much space in the djvu.

I studied DjVuSolo 3.1, it behaves differently with different content, optimizing away layers that practically don't contain useful information, but use an FGBZ instead of a FG44. I saw blur on the background picture behind the JB2 foreground-mask. The official DjVu uses cheap to compress content behind the foreground mask as it will not be shown.

rmast commented 3 years ago

I just witnessed a case where the colorspace issue appeared with a posterized 8 color .png as input in the Python3.8 version, so the issue isn't only appearing in the test. I'll have to further investigate how to solve it and watch whether also the failing test will get solved with a solution.

Here a suggestion to use OpenCV for the conversion: https://stackoverflow.com/questions/62293077/why-is-pils-image-fromarray-distorting-my-image-color https://note.nkmk.me/en/python-opencv-bgr-rgb-cvtcolor/ https://www.ccoderun.ca/programming/doxygen/opencv/group__imgproc__color__conversions.html

But before conversion you should know what colorspace is used in the image. This hint is probably the direction to look in: https://stackoverflow.com/questions/50641637/identify-colour-space-of-any-image-if-icc-profile-is-empty-pil

rmast commented 3 years ago

The issue with the tested image is filed at Gamera-4: https://github.com/hsnr-gamera/gamera-4/issues/35

The issue with the posterized/palletized image can be solved by allowing mode P for PNG.

rmast commented 3 years ago

@FriedrichFroebel, please take my Gamera-4-patch, on my fork-master to solve the to_pil_rgb-issue: https://github.com/rmast/gamera-4/commit/2d9877a084cdac45c5b555aa6574eead4747eb67 I didn't see anything wrong with color conversion, only copy routines translated too fancy and therefore buggy. Probably only meant to work around the String to Bytes issue and probably also a memory-leak issue, but the commit texts involved weren't telling the exact reason of those changes.

When I run make test on didjvu with my new version of the python3-branch I run into issues with test_xmp.py, which attempts to use a deprecated way to del an imported module. As you are more experienced with Python, could you have a look?

FriedrichFroebel commented 3 years ago

@rmast Are you sure the module problem is really related to your Gamera change and not to any of the three XMP backends (if I remember correctly, I did not install all of them for testing)? Which backend this is about? Do you have a specific error message I can use to have a look at?

rmast commented 3 years ago

Friedrich,

I don’t expect the issue to be related to Gamera at all. I guess solving the errors of some tests makes some other tests appear that were previously not visible. If you run make test with my python3-branch within a minute I expect the errors I meant rolling over your screen.

FriedrichFroebel commented 3 years ago

@rmast I have been running each of the test files directly on its own, so I would not expect to see any change on it with your patch (with the same OS and Python as in your case). For this reason I asked which XMP backend libraries you have installed, as I used only one as far as I remember (probably python-xmp-toolkit if I am not mistaken; pyexiv2 does not work on Python 3 anyway if I recall correctly), where no problems arose for the XMP tests. So the backend and the specific traceback should help here until I am able to have another look at the code in the next days if I am able to reproduce your issue.

rmast commented 3 years ago

I work on Mint 20.2. I just did apt-get upgrade to make it easy to follow. These are the specs of my VMWare (virtual) x64 machine: image

Lots of details of apt and pip install: systeem.txt

I installed python-xmp-toolkit, it wasn't installed, but it didn't result in any difference.

This is the exact error at the end of make test:

Failure: NameError (name 'name' is not defined) ... ERROR

======================================================================
ERROR: Failure: NameError (name 'name' is not defined)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "/usr/lib/python3/dist-packages/nose/loader.py", line 416, in loadTestsFromName
    module = self.importer.importFromPath(
  File "/usr/lib/python3/dist-packages/nose/importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/usr/lib/python3/dist-packages/nose/importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/usr/lib/python3.8/imp.py", line 234, in load_module
    return load_source(name, filename, file)
  File "/usr/lib/python3.8/imp.py", line 171, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 702, in _load
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/robert/didjvu/tests/test_xmp.py", line 59, in <module>
    del name
NameError: name 'name' is not defined

----------------------------------------------------------------------
Ran 122 tests in 8.271s

FAILED (errors=1)
make: *** [Makefile:53: test] Fout 1

This is the programtext in that file:

xmp_backends = [
    import_backend(name)
    for name in [
        'gexiv2',
        'libxmp',
        'pyexiv2',
    ]
]
del name  # this is line 59.

If I try to run the single test I get: robert@robert-virtual-machine:~/didjvu/tests$ python3 test_xmp.py

Traceback (most recent call last):
  File "test_xmp.py", line 22, in <module>
    from .tools import (
ImportError: attempted relative import with no known parent package

Edit: I had no libxmp-dev installed via apt. That makes the python2.7 version completely skip the libxmp-tests. Does your version just skip the xmp-tests as well? Installing libxmp-dev doens't change the error.

What is xmp? What of it should be remained in the new upgraded Python3 version?

rmast commented 3 years ago

@FriedrichFroebel I forgot to tag you in above message with all details.

FriedrichFroebel commented 3 years ago

@rmast Seems like I never actually run the XMP tests beforehand - now I could actually reproduce your issue about the undefined variable name, as well as some further scope issues. These should be fixed now. For the Gamera issue, I have not yet compiled your patched version and therefore not tested it (I might do this if your PR gets merged.)

From the docs of the python-xmp-toolkit module:

Python XMP Toolkit is a library for working with XMP metadata, as well as reading/writing XMP metadata stored in many different file formats.

Python XMP Toolkit is wrapping Exempi (using ctypes), a C/C++ XMP library based on Adobe XMP Toolkit, ensuring that future updates to the XMP standard are easily incorporated into the library with a minimum amount of work.

Wikipedia has some more information: https://en.wikipedia.org/wiki/Extensible_Metadata_Platform

I am not sure whether all three backends should be kept. With my latest changes, python-xmp-toolkit works fine, but py3exiv2 (the Python 3 port of pyexiv2) fails and would need some additional work. If anyone wants to fix this, feel free to submit a PR to my fork. (I did not yet test the gexiv2 backend, so no idea if it works out of the box. It might be worth to set up GitHub Actions here to simplify such tests.)

By the way: Running only one test module can be done with a modified version of the implementation of the make test command: python didjvu --test --verbose tests/test_xmp.py.

rmast commented 3 years ago

@FriedrichFroebel Yes! all tests run fine now on your branch python3 when I put the default Python to 3.8 and apt -uninstall all xmp-stuff. I've only issued a PR to your python3-branch for 3 write-lines in djvu_support.py that need an .encode() for the bundle-flow. I'm curious if there is anyone that would bother about pyexiv2. Would it withhold Jakub from doing the upgrade to keep enough attention to this repo?

rmast commented 3 years ago

My PR at hsnr-gamera/gamera-4 has just been merged into master!

rmast commented 3 years ago

@FriedrichFroebel I was looking for code test coverage, but see there is some code coverage statistic in the source tree: tests/coverage

I bet this shows the code that has no test-coverage. So all those lines have to inspected on need to upgrade, for example the write bytes instead of string issue.

rmast commented 3 years ago

Yes! The lines your new test covers don't show up in the code coverage anymore, however the bytes-issue also shows up in a standard coverage package. Don't know if I solved it right, but the private/update-coverage runs: diff coverage statistics.txt diff private update-coverage.txt "~/.local/lib/python3.8/site-packages/coverage/summary.py" line 30: self.outfile.write(line.rstrip().encode()) self.outfile.write(b"\n")

FriedrichFroebel commented 3 years ago

@rmast It actually is much simpler for Python3-only code: Just use report_stream = plugin.stream = io.StringIO() instead of the current BytesIO(). I have fixed this in the fork.

rmast commented 2 years ago

@mara004 wrote

You can also achieve good compression ratios with PDF and lossless JBIG2 encoding, for example. The PDF format has the clear advantage of much better compatibility.

Ever seen this project? https://github.com/internetarchive/archive-pdf-tools Unfortunately it doesn't work without provided hocr-file, with an open issue: https://github.com/internetarchive/archive-pdf-tools/issues/11

mara004 commented 2 years ago

@rmast

Ever seen this project? https://github.com/internetarchive/archive-pdf-tools

I didn't know this yet, but it's highly interesting. I wonder whether the author of OCRmyPDF knows about archive-pdf-tools.

rmast commented 2 years ago

I doubt it. I want to investigate how good it is, it probably only supports the written happy flow, It chokes with complex Python erros on leaving out some of those parameters.

mara004 commented 2 years ago

I doubt it. I want to investigate how good it is, it probably only supports the written happy flow, It chokes with complex Python erros on leaving out some of those parameters.

You mean the project claims a reliability it does not offer?

rmast commented 2 years ago

I’ve not seen any reliability-claim for general use. Only it’s name, the internet archive, does promise some professional quality for the happy flow:

“While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don't expect this to be super well documented just yet.”

rmast commented 2 years ago

@mara004: I've found an issue just with the first run that succeeded with a multipage scanned PDF. The background contains fuzz from the partial pixels just as djvumake. C44 performs better. The suggestion at the bottom of the repo https://github.com/internetarchive/archive-pdf-tools#examining-the-results would probably only be handy with a manual review in a workflow, as done in gscan2pdf and scantailor.

rmast commented 2 years ago

We should probably try to get it working on Python 3.10 as well: https://github.com/jwilk/python-djvulibre/issues/13

FriedrichFroebel commented 2 years ago

@rmast I am currently on Python 3.8.10 due to my distro, so no way to directly check it (leaving GitHub Actions aside). But it seems like didjvu uses subprocess calls in the corresponding djvulibre wrapper (https://github.com/jwilk/didjvu/blob/master/lib/djvu_support.py) instead of the native wrapper, so it should work in theory.

rmast commented 2 years ago

This instruction reveals Python3.10.2 at the moment: https://computingforgeeks.com/how-to-install-python-on-ubuntu-linux-system/

rmast commented 2 years ago

This instruction allows switching between default Python-versions: https://stackoverflow.com/questions/43062608/how-to-update-alternatives-to-python-3-without-breaking-apt

rmast commented 2 years ago

This non-LTS Ubuntu distro 21.04 has 3.10 in the package manager: https://packages.ubuntu.com/hirsute/python3.10-distutils

rmast commented 2 years ago

I fixed a Python3.10 issue in Gamera-4: https://github.com/hsnr-gamera/gamera-4/pull/39 and a Python3.9 issue in didjvu: https://github.com/rmast/didjvu/tree/python3.9

With Python3.9 there still are some new gi-import-warnings with test_xmp

rmast commented 2 years ago

I've now seen all tests run Ok in Ubuntu 22.04 with these extra packages and my python3.9 branch.

sudo apt install python3-pip gir1.2-gexiv2-0.10 libexempi-dev libboost-python-dev libexiv2-dev libpng-dev libtiff-dev djvulibre-bin exiv2 python3-pil

pip install py3exiv2 pip install python-xmp-toolkit pip install nose

Friedrich sees room for improvement of my GExiv2-fix. But I think we're near a viable Python3.10 version for the coming Ubuntu 22.04

andyjpb commented 13 hours ago

Hi, I'm interested in running didjvu on Debian again.

I have upgraded my last machine from buster and the packages are no longer available, presumably because of no python2.7 support.

https://github.com/FriedrichFroebel/didjvu/issues/5#issuecomment-1044544118 Says that Gamera 4 is now officially released.

Does anyone have any patches I can try to give the python3 port a test?

I'm happy to built it myself but I would appreciate some pointers around which dependencies I need and whether I need to build those myself as well (It looks like I'll need to at least build Gamera 4).

Thanks for any pointers you can give me.

jsbien commented 12 hours ago

https://github.com/FriedrichFroebel/didjvu This repository is a port of the original repository to Python 3.