Open markuschaaf opened 5 years ago
I don't have any plans to port didjvu to Python 3. Python 2 is a fine language and the motions to remove it from distros are ill-advised.
For me it seems like porting didjvu
to Python 3 or making it compatible with Python 2 and Python 3 should be relatively easy once gamera
supports Python 3 as well (see https://github.com/hsnr-gamera/gamera/issues/19).
I don't have any plans to port didjvu to Python 3. Python 2 is a fine language and the motions to remove it from distros are ill-advised.
Python 2 is not maintained anymore regarding security. This means distros do not have a choice.
Gamera developer @cdalitz says that the main branch has already been ported completely to Python 3 (https://github.com/hsnr-gamera/gamera-4), however it is marked as 'experimental' in the description and it doesn't seem to have an official release yet. @jwilk Could you please consider porting didjvu to Python 3 anyway? Python 2 is rarely used nowadays, and, as @blaueente pointed out, all major distributions are about to remove it or have done so already because they have to regarding security. I barely know of any other reasonably popular program which is still maintained and deliberately keeps using Python 2 ...
Concerning the python 3 port of Gamera (gamera-4), this is indeed finished. It is nevertheless still marked as "experimental", because it is not extensively tested. As I no longer use Gamera myself in any of the projects that I currently work on, I do not have the opportunity to test and fix it. Thus, if someone finds any bugs, patches for fixing them are highly welcome.
@cdalitz Okay, thanks for clarifying!
I understand a virtual environment for python 2 can be created on e.g. stable Debian and the program run inside it. I will appreciate a fool-proof instruction how to actually do it.
@jsbien Python 2 is still available as official Debian package up to sid, so you probably don't have to worry about Python 2 for (at least) the next 5 years if you're on Debian. I'm not sure why they decided to keep Python 2 so long, though - an unmaintained programming language interpreter is a rather big security risk after all.
In general I think it might be better just not to use the djvu format anymore. The vast majority of djvu software is unmaintained, and outside the linux/bsd scope there are very few programs left that can open djvu at all. You can also achieve good compression ratios with PDF, which is a much more compatible format.
@mara004 As for DjVu: djview4 and djvulibre is very well maintained, and new software is created, e.g. https://github.com/trufanov-nok/minidjvu-mod/. For me the compression ratio is the least important feature of DjVu, it has a lot of other advantages which are demonstrated by our tools such as https://github.com/jsbien/djview4shapes and https://bitbucket.org/mrudolf/djview-poliqarp. Their use it demonstrated e.g. by https://github.com/jsbien/iLindeCSV and https://github.com/jsbien/Zaborowski-index4djview.
I won't deny there is still some active djvu software, but it seems most of it is rather intended for research than for practical use. Development of djvulibre has been slowing down a lot, and the djvu format is barely used compared to PDF or TIFF. Since most macOS, Windows or mobile users won't be able to open djvu, it is also very unsuitable for sharing.
At least Gamera has been ported to Python 3 (use the Gamera 4 version). If you encounter any problems with Gamera under Python 4, please consder filing a bug report there. This should thus not be an obstacle to porting djvu to Python 3, I think.
I made some experiments with Gamera 4 and encountered no problems. Bastien Roucariès, who already ported ocrodjvu to Python 3, suggested "shotgun porting" of didjvu:
Use the testsuite, and the automatic conversion tool from python porting. Fix every bug that show during test suite and voila. It take me two your to fix the previous package.
Anybody willing to try this approach?
I just had a look at porting didjvu to Python 3, with the following issues arising:
didjvu.tests.test_utils.test_enhance_import
can be fixed as done for ocrodjvu, but this will not really enhance the error message any more.didjvu.lib.cli.ArgumentParser.parse_args
does not work as before and seems to require additional handling for the missing fg_bg_defaults
attribute if no parameters are set.didjvu.tests.test_gamera.test_to_pil_rgb.test_color
fails. There is something wrong with the output, although I do not know whether this is an issue in gamera-4 or didjvu: https://user-images.githubusercontent.com/7279752/133065608-e7a6089b-af24-46b6-8cce-6d3bf60bc5eb.png (standalone version being compatible to Python 3: ycbcr-jpeg.py)@FriedrichFroebel wrote:
I just had a look at porting didjvu to Python 3
I don't see your fork?
@mara004 I agree PDF is much more common, and I guess if you put the MRC-djvu result of didjvu through DjVuToy to translate it to PDF it will not be much bigger with JP2000 instead of the FG44 IW44 image masked by the JBIG2 in a similar way as is done in a multilayer DjVu.
@rmast I did this on an old clone of this repository back then for testing and realized the aforementioned porting issues, so I did not upload these changes to GitHub. The incomplete/partially broken Python 3 port is now available in my fork.
Thanks! I have too little Python-experience to do the full port myself, but I can focus on details.
I compiled Gamera-4 yesterday and ran 2to3 on didjvu, but I already got stuck on some arguments, which seems to be a quite standard porting issue, however I don’t know a site to look up all porting-errors and corresponding fixes. Do you know how to operate libcst to do most of the work?
The fork of @FriedrichFroebel just does the job in python3.8 on Mint 20.2 when I run didjvu encode, after compiling and installing Gamera-4 without wx. https://github.com/hsnr-gamera/gamera-4
I don't know how to call it to reproduce the issues that @FriedrichFroebel thought were still there?
Edit: I found it: run
make test
It only gives a test-issue with tests.test_gamera.test_to_pil_rgb.test_color. So the output has to be judged to be able to point to the right repo to solve it.
@FriedrichFroebel
This shows the way to see the ycbcr-jpeg.tiff contains a given colorspace:
exiftool -S -PhotometricInterpretation didjvu/tests/data/ycbcr-jpeg.tiff
PhotometricInterpretation: YCbCr
Both In.jpg and Out.jpg appear not to have any Colorspace information:
od -A x -t x1z -v out.jpg
gives no APP* or whatever segment markers popping up in the right column as described here: https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format
So also no Adobe APP14 marker which could distinguish between RGB and YCbCr. https://stackoverflow.com/questions/50798014/determining-color-space-for-jpeg/50861048
However the default color scheme for JPEG is YCBCR.
The documentation of to_pil says it only supports RGB and Grayscale. So putting in a YCbCr image probably already leads to an undefined situation.
The tested code seems to try to replace some Gamera-bugs, or try to speed up to_pil with a custom to_pil_rgb.
They might have a history in the commits that tells more about what happened and why they're introduced in the first place.
Glad to see that the port is working, as I have only used the tests before (after 2to3 conversion and some manual fixes). While I have no clean solution for the aforementioned issues (https://github.com/jwilk/didjvu/issues/13#issuecomment-918048197), I do not feel like a PR makes sense - besides the fact that Python 3 support does not seem to be considered useful by upstream.
I am clearly not an expert on the colorspace stuff, so there is not much I can say about it. The commit history for the Gamera support does not seem to tell us much about it as well: https://github.com/jwilk/didjvu/commit/2337b8fe7429ba99206d1e971c76a9f2f3686f48, https://github.com/jwilk/didjvu/commit/fdd6bf9e069a012eb446dba14c228e414cb44213.
@FriedrichFroebel As I read those commits you pointed at it might be just an optimization step that made the assumption of RGB necessary, while most real-life images are usually YCrCb. Reverting exact those two commits you pointed at drops the assumption on RGB, and even the failing test that comes with it.
Edit: unfortunately the program then fails on the inputpicture not being RGB.
The only thing that should be thoroughly tested then is behavior with source images of different color spaces, however usually images in scanned input will behave consistently, so if the colorspace fails someone will know at first try.
A PR is not necessary at te moment, as the Ubuntu 18-trick of getting the old dropped python-gamera package to work on Ubuntu 20 with Python 2.7 is still valid.
As soon as a valid python-gamera package is not reachable that way anymore because some dependencies of the Ubuntu 18 package get upgraded @jwilk will have to decide how to keep the didjvu usable. That might be with the introduction of Ubuntu 22.04 LTS next year, which might even raise the bar further on the supported Python-version, and deprecate 2to3.
The package maintainer of Debian has abandoned python-gamera as has has its maintainer.
gamera-4 might get out of Alpha at some moment, that would be the moment to put effort in the upstream again, and probably even put effort in getting gamera-4 back in the debian packages.
I committed some python3-changes to my fork of the python3 branch as well, for getting the 'bundle' function to work properly.
I also made another branch for supporting minidjvu-mod with the -2 parameter to call when --pages-per-dict > 1.
However, even with minidjvu-mod in place I see only a small reduction of the size. The resulting djvu-filesize is still way bigger than I would expect from DjVuSolo 3.1. When I scan a letter with a colored logo, an autograph and some colored text on the bottom there is mostly lots of blur on the background-picture, but it takes way too much space in the djvu.
I studied DjVuSolo 3.1, it behaves differently with different content, optimizing away layers that practically don't contain useful information, but use an FGBZ instead of a FG44. I saw blur on the background picture behind the JB2 foreground-mask. The official DjVu uses cheap to compress content behind the foreground mask as it will not be shown.
I just witnessed a case where the colorspace issue appeared with a posterized 8 color .png as input in the Python3.8 version, so the issue isn't only appearing in the test. I'll have to further investigate how to solve it and watch whether also the failing test will get solved with a solution.
Here a suggestion to use OpenCV for the conversion: https://stackoverflow.com/questions/62293077/why-is-pils-image-fromarray-distorting-my-image-color https://note.nkmk.me/en/python-opencv-bgr-rgb-cvtcolor/ https://www.ccoderun.ca/programming/doxygen/opencv/group__imgproc__color__conversions.html
But before conversion you should know what colorspace is used in the image. This hint is probably the direction to look in: https://stackoverflow.com/questions/50641637/identify-colour-space-of-any-image-if-icc-profile-is-empty-pil
The issue with the tested image is filed at Gamera-4: https://github.com/hsnr-gamera/gamera-4/issues/35
The issue with the posterized/palletized image can be solved by allowing mode P for PNG.
@FriedrichFroebel, please take my Gamera-4-patch, on my fork-master to solve the to_pil_rgb-issue: https://github.com/rmast/gamera-4/commit/2d9877a084cdac45c5b555aa6574eead4747eb67 I didn't see anything wrong with color conversion, only copy routines translated too fancy and therefore buggy. Probably only meant to work around the String to Bytes issue and probably also a memory-leak issue, but the commit texts involved weren't telling the exact reason of those changes.
When I run make test on didjvu with my new version of the python3-branch I run into issues with test_xmp.py, which attempts to use a deprecated way to del an imported module. As you are more experienced with Python, could you have a look?
@rmast Are you sure the module problem is really related to your Gamera change and not to any of the three XMP backends (if I remember correctly, I did not install all of them for testing)? Which backend this is about? Do you have a specific error message I can use to have a look at?
Friedrich,
I don’t expect the issue to be related to Gamera at all. I guess solving the errors of some tests makes some other tests appear that were previously not visible. If you run make test with my python3-branch within a minute I expect the errors I meant rolling over your screen.
@rmast I have been running each of the test files directly on its own, so I would not expect to see any change on it with your patch (with the same OS and Python as in your case). For this reason I asked which XMP backend libraries you have installed, as I used only one as far as I remember (probably python-xmp-toolkit
if I am not mistaken; pyexiv2
does not work on Python 3 anyway if I recall correctly), where no problems arose for the XMP tests. So the backend and the specific traceback should help here until I am able to have another look at the code in the next days if I am able to reproduce your issue.
I work on Mint 20.2. I just did apt-get upgrade to make it easy to follow. These are the specs of my VMWare (virtual) x64 machine:
Lots of details of apt and pip install: systeem.txt
I installed python-xmp-toolkit, it wasn't installed, but it didn't result in any difference.
This is the exact error at the end of make test:
Failure: NameError (name 'name' is not defined) ... ERROR
======================================================================
ERROR: Failure: NameError (name 'name' is not defined)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nose/failure.py", line 39, in runTest
raise self.exc_val.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/nose/loader.py", line 416, in loadTestsFromName
module = self.importer.importFromPath(
File "/usr/lib/python3/dist-packages/nose/importer.py", line 47, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/usr/lib/python3/dist-packages/nose/importer.py", line 94, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/usr/lib/python3.8/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/usr/lib/python3.8/imp.py", line 171, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 702, in _load
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/robert/didjvu/tests/test_xmp.py", line 59, in <module>
del name
NameError: name 'name' is not defined
----------------------------------------------------------------------
Ran 122 tests in 8.271s
FAILED (errors=1)
make: *** [Makefile:53: test] Fout 1
This is the programtext in that file:
xmp_backends = [
import_backend(name)
for name in [
'gexiv2',
'libxmp',
'pyexiv2',
]
]
del name # this is line 59.
If I try to run the single test I get: robert@robert-virtual-machine:~/didjvu/tests$ python3 test_xmp.py
Traceback (most recent call last):
File "test_xmp.py", line 22, in <module>
from .tools import (
ImportError: attempted relative import with no known parent package
Edit: I had no libxmp-dev installed via apt. That makes the python2.7 version completely skip the libxmp-tests. Does your version just skip the xmp-tests as well? Installing libxmp-dev doens't change the error.
What is xmp? What of it should be remained in the new upgraded Python3 version?
@FriedrichFroebel I forgot to tag you in above message with all details.
@rmast Seems like I never actually run the XMP tests beforehand - now I could actually reproduce your issue about the undefined variable name
, as well as some further scope issues. These should be fixed now. For the Gamera issue, I have not yet compiled your patched version and therefore not tested it (I might do this if your PR gets merged.)
From the docs of the python-xmp-toolkit
module:
Python XMP Toolkit is a library for working with XMP metadata, as well as reading/writing XMP metadata stored in many different file formats.
Python XMP Toolkit is wrapping Exempi (using ctypes), a C/C++ XMP library based on Adobe XMP Toolkit, ensuring that future updates to the XMP standard are easily incorporated into the library with a minimum amount of work.
Wikipedia has some more information: https://en.wikipedia.org/wiki/Extensible_Metadata_Platform
I am not sure whether all three backends should be kept. With my latest changes, python-xmp-toolkit
works fine, but py3exiv2
(the Python 3 port of pyexiv2
) fails and would need some additional work. If anyone wants to fix this, feel free to submit a PR to my fork. (I did not yet test the gexiv2
backend, so no idea if it works out of the box. It might be worth to set up GitHub Actions here to simplify such tests.)
By the way: Running only one test module can be done with a modified version of the implementation of the make test
command: python didjvu --test --verbose tests/test_xmp.py
.
@FriedrichFroebel Yes! all tests run fine now on your branch python3 when I put the default Python to 3.8 and apt -uninstall all xmp-stuff. I've only issued a PR to your python3-branch for 3 write-lines in djvu_support.py that need an .encode() for the bundle-flow. I'm curious if there is anyone that would bother about pyexiv2. Would it withhold Jakub from doing the upgrade to keep enough attention to this repo?
My PR at hsnr-gamera/gamera-4 has just been merged into master!
@FriedrichFroebel I was looking for code test coverage, but see there is some code coverage statistic in the source tree: tests/coverage
I bet this shows the code that has no test-coverage. So all those lines have to inspected on need to upgrade, for example the write bytes instead of string issue.
Yes! The lines your new test covers don't show up in the code coverage anymore, however the bytes-issue also shows up in a standard coverage package. Don't know if I solved it right, but the private/update-coverage runs: diff coverage statistics.txt diff private update-coverage.txt "~/.local/lib/python3.8/site-packages/coverage/summary.py" line 30: self.outfile.write(line.rstrip().encode()) self.outfile.write(b"\n")
@rmast It actually is much simpler for Python3-only code: Just use report_stream = plugin.stream = io.StringIO()
instead of the current BytesIO()
. I have fixed this in the fork.
@mara004 wrote
You can also achieve good compression ratios with PDF and lossless JBIG2 encoding, for example. The PDF format has the clear advantage of much better compatibility.
Ever seen this project? https://github.com/internetarchive/archive-pdf-tools Unfortunately it doesn't work without provided hocr-file, with an open issue: https://github.com/internetarchive/archive-pdf-tools/issues/11
@rmast
Ever seen this project? https://github.com/internetarchive/archive-pdf-tools
I didn't know this yet, but it's highly interesting. I wonder whether the author of OCRmyPDF knows about archive-pdf-tools
.
I doubt it. I want to investigate how good it is, it probably only supports the written happy flow, It chokes with complex Python erros on leaving out some of those parameters.
I doubt it. I want to investigate how good it is, it probably only supports the written happy flow, It chokes with complex Python erros on leaving out some of those parameters.
You mean the project claims a reliability it does not offer?
I’ve not seen any reliability-claim for general use. Only it’s name, the internet archive, does promise some professional quality for the happy flow:
“While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don't expect this to be super well documented just yet.”
@mara004: I've found an issue just with the first run that succeeded with a multipage scanned PDF. The background contains fuzz from the partial pixels just as djvumake. C44 performs better. The suggestion at the bottom of the repo https://github.com/internetarchive/archive-pdf-tools#examining-the-results would probably only be handy with a manual review in a workflow, as done in gscan2pdf and scantailor.
We should probably try to get it working on Python 3.10 as well: https://github.com/jwilk/python-djvulibre/issues/13
@rmast I am currently on Python 3.8.10 due to my distro, so no way to directly check it (leaving GitHub Actions aside). But it seems like didjvu uses subprocess calls in the corresponding djvulibre wrapper (https://github.com/jwilk/didjvu/blob/master/lib/djvu_support.py) instead of the native wrapper, so it should work in theory.
This instruction reveals Python3.10.2 at the moment: https://computingforgeeks.com/how-to-install-python-on-ubuntu-linux-system/
This instruction allows switching between default Python-versions: https://stackoverflow.com/questions/43062608/how-to-update-alternatives-to-python-3-without-breaking-apt
This non-LTS Ubuntu distro 21.04 has 3.10 in the package manager: https://packages.ubuntu.com/hirsute/python3.10-distutils
I fixed a Python3.10 issue in Gamera-4: https://github.com/hsnr-gamera/gamera-4/pull/39 and a Python3.9 issue in didjvu: https://github.com/rmast/didjvu/tree/python3.9
With Python3.9 there still are some new gi-import-warnings with test_xmp
I've now seen all tests run Ok in Ubuntu 22.04 with these extra packages and my python3.9 branch.
sudo apt install python3-pip gir1.2-gexiv2-0.10 libexempi-dev libboost-python-dev libexiv2-dev libpng-dev libtiff-dev djvulibre-bin exiv2 python3-pil
pip install py3exiv2 pip install python-xmp-toolkit pip install nose
Friedrich sees room for improvement of my GExiv2-fix. But I think we're near a viable Python3.10 version for the coming Ubuntu 22.04
Hi, I'm interested in running didjvu on Debian again.
I have upgraded my last machine from buster and the packages are no longer available, presumably because of no python2.7 support.
https://github.com/FriedrichFroebel/didjvu/issues/5#issuecomment-1044544118 Says that Gamera 4 is now officially released.
Does anyone have any patches I can try to give the python3 port a test?
I'm happy to built it myself but I would appreciate some pointers around which dependencies I need and whether I need to build those myself as well (It looks like I'll need to at least build Gamera 4).
Thanks for any pointers you can give me.
https://github.com/FriedrichFroebel/didjvu This repository is a port of the original repository to Python 3.
Python 2 will be EOL end of 2019. Distributions will stop shipping it. https://pythonclock.org/