internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
104 stars 14 forks source link

Windows port #22

Closed MerlijnWajer closed 2 years ago

MerlijnWajer commented 3 years ago
MerlijnWajer commented 3 years ago

The only thing that we need to fix for the Windows port to be initially functioning is a pattern I unfortunately use a lot: open a file, remove it but keep the fd around. Windows doesn't allow this. See for example os.remove(tiff_in) in insert_images_mrc

MerlijnWajer commented 3 years ago

There is a kakadu release for Windows:

We also need to solve this as long as we have no jbig2enc built for Windows:

fusefib commented 2 years ago

Looking forward to this project working on Windows.

Isn't the binary here (jbig2.exe) sufficient? https://github.com/2m/image-to-jbig2-pdf

MerlijnWajer commented 2 years ago

This is the project that it relies on: https://github.com/agl/jbig2enc

I haven't tried building it on Windows yet because it mentioned Visual Studio, and I don't have that lying around. Perhaps the automake build could also work. Ideally someone would do something like the jpegoptim-windows guy did - providing some ways to build the binary on Github Actions, or some other CI.

MerlijnWajer commented 2 years ago

I'm currently working on refactoring the code to make Windows mostly work (even with ccitt instead of jbig2 the results are pretty good). If you have a way to get a working jbig2enc binary for Windows, I'm happy to work with you to make sure this works too.

MerlijnWajer commented 2 years ago

As of commit f072c89b40a061d55d272d6f128c1caec644925f the latest github artifact should work on Windows. I'll issue a new release today or tomorrow. I tried this:

wine python.exe Scripts/recode_pdf --from-imagestack 'data/sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file data/sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata data/sim_english-illustrated-magazine_1884-12_2_15_scandata.xml --dpi 400 -m 2 -t 10 --mask-compression ccitt --denoise fast -v -o /tmp/out.pdf

(I didn't find a jbig2enc yet, so I'm using ccitt. The default JPEG2000 implementation is Pillow, although using kakadu can definitely speed up the process some more, it seems like a sane default)

fusefib commented 2 years ago

Apologies for posting in a closed issue thread if that's out of the norm. I agree that a binary that can be automatically built would be ideal, but doing the following on Linux:

sudo apt-get install automake
sudo apt install libtool
sudo apt install libleptonica-dev
sudo apt install zlib1g-dev
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make

gets me a jbig2 binary:

abc@host:~/Desktop/jbig2enc$ jbig2 
No filename given

Usage: jbig2 [options] <input filenames...>
Options:
  -b <basename>: output file root name when using symbol coding
  -d --duplicate-line-removal: use TPGD in generic region coder
  -p --pdf: produce PDF ready data
  -s --symbol-mode: use text region, not generic coder
  -t <threshold>: set classification threshold for symbol coder (def: 0.85)
  -T <bw threshold>: set 1 bpp threshold (def: 188)
  -r --refine: use refinement (requires -s: lossless)
  -O <outfile>: dump thresholded image as PNG
  -2: upsample 2x before thresholding
  -4: upsample 4x before thresholding
  -S: remove images from mixed input and save separately
  -j --jpeg-output: write images from mixed input as JPEG
  -a --auto-thresh: use automatic thresholding in symbol encoder
  --no-hash: disables use of hash function for automatic thresholding
  -V --version: version info
  -v: be verbose

jbig2.exe from the above-linked github seems to be a slightly older binary, but still working nonetheless.

C:\out>jbig2
No filename given

Usage: jbig2 [options] <input filenames...>
Options:
  -b <basename>: output file root name when using symbol coding
  -d --duplicate-line-removal: use TPGD in generic region coder
  -p --pdf: produce PDF ready data
  -s --symbol-mode: use text region, not generic coder
  -t <threshold>: set classification threshold for symbol coder (def: 0.85)
  -T <bw threshold>: set 1 bpp threshold (def: 188)
  -r --refine: use refinement (requires -s: lossless)
  -O <outfile>: dump thresholded image as PNG
  -2: upsample 2x before thresholding
  -4: upsample 4x before thresholding
  -S: remove images from mixed input and save separately
  -j --jpeg-output: write images from mixed input as JPEG
  -v: be verbose

I don't remember the exact provenance (Virustotal clears it), but here is an up-to-date binary (jbig2enc 0.28)

I can also try to compile it myself with VS later if that's needed. I'm still confused about "no jbig2enc built," though.

MerlijnWajer commented 2 years ago

I think the binary is called jbig2 (and not jbig2enc), so that makes sense. I should have written "no jbig2enc build" instead of "built" perhaps. What I mean is a clear and simple way for users to get a jbig2.exe file that they can trust. The point is mostly that I'm not psyched about having a link in the instructions to some online mega.co.nz or other download host that contains a jbig2.exe file . I think the file you linked on mega.co.nz might already just work if you set --mask-compression jbig2 and ensure it is in the PATH, but it's not something I'd like to link to in the README.

Having a way + some instructions on how to build it on Windows would be great. Then I can either build it myself or link to a known-good source (like your build). I have also built the jbig2 binary without problems on Linux, but I don't have a Windows system, I just test with wine.

MerlijnWajer commented 2 years ago

Thanks for the info/instructions btw, we can either use this issue or create a new one to sort out the jbig2.exe situation - either way is fine by me.

fusefib commented 2 years ago

I found the source for the above binary, I think: https://github.com/anotatta/jbig2enc/releases/tag/0.29 Perhaps that can be better trusted to an extent.

MerlijnWajer commented 2 years ago

This executable requires various libraries, though, like liblept-5.dll (leptonica) and libgcc_s_seh-1.dll and libstdc++-6.dll, so those would also have to be packaged, ideally the binary is statically compiled.

MerlijnWajer commented 2 years ago

Or we'd have to document getting leptonica and MinGW set up.

MerlijnWajer commented 2 years ago

(Sorry, I'm not really a Windows user, so I am not sure what the usual sensible approach would be).

I'm also looking at utilising pyinstaller to make a standalone .exe file with everything contained.

fusefib commented 2 years ago

Ugh, you're right. I had a PATH environment variable set to 'C:\Program Files\Tesseract-OCR' so I overlooked the needed DLLs. Yes, ideally it should be statically compiled.

fusefib commented 2 years ago

Some further info: anotatta's binary hosted on Github is 32-bit and requires a 32-bit version of libstdc++-6.dll (which also appears to be the version included in the Tesseract-OCR installation (source).

Running ldd gives this result:

W:\jbig2enc-32bit>ldd jbig2.exe
        ntdll.dll => /c/Windows/SYSTEM32/ntdll.dll (0x7ffe8b210000)
        KERNEL32.DLL => /c/Windows/System32/KERNEL32.DLL (0x7ffe89130000)
        KERNELBASE.dll => /c/Windows/System32/KERNELBASE.dll (0x7ffe87340000)
        msvcrt.dll => /c/Windows/System32/msvcrt.dll (0x7ffe89090000)
        WS2_32.dll => /c/Windows/System32/WS2_32.dll (0x7ffe88880000)
        RPCRT4.dll => /c/Windows/System32/RPCRT4.dll (0x7ffe885a0000)
        liblept-5.dll => /c/Program Files/Tesseract-OCR/liblept-5.dll (0x71040000)
        libgcc_s_seh-1.dll => /c/Program Files/Tesseract-OCR/libgcc_s_seh-1.dll (0x61440000)
        GDI32.dll => /c/Windows/System32/GDI32.dll (0x7ffe8b1b0000)
        gdi32full.dll => /c/Windows/System32/gdi32full.dll (0x7ffe875e0000)
        libstdc++-6.dll => /c/Program Files/Tesseract-OCR/libstdc++-6.dll (0x1050000)
        msvcp_win.dll => /c/Windows/System32/msvcp_win.dll (0x7ffe880c0000)
        ucrtbase.dll => /c/Windows/System32/ucrtbase.dll (0x7ffe881b0000)
        USER32.dll => /c/Windows/System32/USER32.dll (0x7ffe88c20000)
        libwinpthread-1.dll => /c/Program Files/Tesseract-OCR/libwinpthread-1.dll (0x64940000)
        win32u.dll => /c/Windows/System32/win32u.dll (0x7ffe87320000)
        libjpeg-8.dll => /c/Program Files/Tesseract-OCR/libjpeg-8.dll (0x6b800000)
        libgif-7.dll => /c/Program Files/Tesseract-OCR/libgif-7.dll (0x65880000)
        libopenjp2.dll => /c/Program Files/Tesseract-OCR/libopenjp2.dll (0x70b40000)
        libpng16-16.dll => /c/Program Files/Tesseract-OCR/libpng16-16.dll (0x68b40000)
        libtiff-5.dll => /c/Program Files/Tesseract-OCR/libtiff-5.dll (0x68ec0000)
        zlib1.dll => /c/Program Files/Intel/WiFi/bin/zlib1.dll (0x73480000)
        libwebp-7.dll => /c/Program Files/Tesseract-OCR/libwebp-7.dll (0x61940000)
        libjbig-2.dll => /c/Program Files/Tesseract-OCR/libjbig-2.dll (0x64900000)
        VCRUNTIME140.dll => /c/Windows/SYSTEM32/VCRUNTIME140.dll (0x7ffe74b30000)
        liblzma-5.dll => /c/Program Files/Tesseract-OCR/liblzma-5.dll (0x63cc0000)
        libstdc++-6.dll => /c/Program Files/Tesseract-OCR/libstdc++-6.dll (0x1050000)

The SYSTEM32 dlls are a non-problem. zlib1.dll is also in the Tesseract-OCR installation folder.

Redoing the 32-bit is giving me a bit of trouble, but I managed to compile a 64-bit .exe with MSYS2 (MinGW x64) after applying this patch and it requires a 64-bit version of libstdc++-6.dll.

W:\jbig2enc-64bit>ldd jbig2.exe
        ntdll.dll => /c/Windows/SYSTEM32/ntdll.dll (0x7ffe8b210000)
        KERNEL32.DLL => /c/Windows/System32/KERNEL32.DLL (0x7ffe89130000)
        KERNELBASE.dll => /c/Windows/System32/KERNELBASE.dll (0x7ffe87340000)
        ucrtbase.dll => /c/Windows/System32/ucrtbase.dll (0x7ffe881b0000)
        WS2_32.dll => /c/Windows/System32/WS2_32.dll (0x7ffe88880000)
        RPCRT4.dll => /c/Windows/System32/RPCRT4.dll (0x7ffe885a0000)
        libstdc++-6.dll => /w/jbig2enc-64bit/libstdc++-6.dll (0x7ffe55580000)
        libgcc_s_seh-1.dll => /c/Program Files/Tesseract-OCR/libgcc_s_seh-1.dll (0x61440000)
        liblept-5.dll => /c/Program Files/Tesseract-OCR/liblept-5.dll (0x71040000)
        msvcrt.dll => /c/Windows/System32/msvcrt.dll (0x7ffe89090000)
        GDI32.dll => /c/Windows/System32/GDI32.dll (0x7ffe8b1b0000)
        gdi32full.dll => /c/Windows/System32/gdi32full.dll (0x7ffe875e0000)
        msvcp_win.dll => /c/Windows/System32/msvcp_win.dll (0x7ffe880c0000)
        USER32.dll => /c/Windows/System32/USER32.dll (0x7ffe88c20000)
        libwinpthread-1.dll => /c/Program Files/Tesseract-OCR/libwinpthread-1.dll (0x64940000)
        win32u.dll => /c/Windows/System32/win32u.dll (0x7ffe87320000)
        libgif-7.dll => /c/Program Files/Tesseract-OCR/libgif-7.dll (0x65880000)
        libjpeg-8.dll => /c/Program Files/Tesseract-OCR/libjpeg-8.dll (0x6b800000)
        libpng16-16.dll => /c/Program Files/Tesseract-OCR/libpng16-16.dll (0x68b40000)
        libopenjp2.dll => /c/Program Files/Tesseract-OCR/libopenjp2.dll (0x70b40000)
        libtiff-5.dll => /c/Program Files/Tesseract-OCR/libtiff-5.dll (0x68ec0000)
        zlib1.dll => /c/Program Files/Intel/WiFi/bin/zlib1.dll (0x73480000)
        libwebp-7.dll => /c/Program Files/Tesseract-OCR/libwebp-7.dll (0x61940000)
        VCRUNTIME140.dll => /c/Windows/SYSTEM32/VCRUNTIME140.dll (0x7ffe74b30000)
        libjbig-2.dll => /c/Program Files/Tesseract-OCR/libjbig-2.dll (0x64900000)
        liblzma-5.dll => /c/Program Files/Tesseract-OCR/liblzma-5.dll (0x63cc0000)

When I tried to change settings to compile everything into a static binary it errored saying that I don't have Leptonica installed (even though it was). I'm admittedly not very experienced with this.

Anyway, 1) Tesseract-OCR could be installed and then the 32-bit jbig2.exe simply dropped into the installation folder (may take an UAC prompt to approve moving the file in a folder in Program Files), 2) The Tesseract-OCR installation directory can be added within the PATH environment variable and 64-bit jbig2.exe & 64-bit libstdc++-6.dll can work elsewhere. 3) jbig2.exe gets distributed with all the dlls? 4) Static build?

MerlijnWajer commented 2 years ago

Suggesting to install Tesseract is a fine solution by me, I was planning to include that later in any case when I get to the OCR part of all of this (that runs before PDF).

MerlijnWajer commented 2 years ago

Alternatively, I might try my hand at (4) at some point, but probably not in the next few weeks.

fusefib commented 2 years ago

An additional note: after running pip install archive-pdf-tools the resulting scripts lack extensions, such that a command in Windows would have to be something like: python C:\Python38\Scripts\recode_pdf. Adding a .py extension to the file, if everything else is properly set up, should make recode_pdf available as a command. It's probably a matter of python packaging, and it's unimportant other than it should be documented for new users.

E.g. tif -> pdf would currently go something like this:

for %f IN (*.tif) do ( tesseract -l deu "%f" - hocr > "_out_%~nf.hocr" && python C:\Python38\Scripts\recode_pdf --from-imagestack "%f" --hocr-file "_out_%~nf.hocr" --dpi 600 -m 2 --hq-pages 1 --mask-compression jbig2 --denoise-mask fast --bg-downsample 3 -v -o "__out_%~nf.pdf" )


No matter how it's tweaked, not great results for scanned book pages with photo illustrations so far. MRC doesn't seem appropriate use case for that, though I recall (ABBYY-generated?) PDFs on archive.org typically looking a bit better.

MerlijnWajer commented 2 years ago

An additional note: after running pip install archive-pdf-tools the resulting scripts lack extensions, such that a command in Windows would have to be something like: python C:\Python38\Scripts\recode_pdf. Adding a .py extension to the file, if everything else is properly set up, should make recode_pdf available as a command. It's probably a matter of python packaging, and it's unimportant other than it should be documented for new users.

Yeah, I also realised that happened, but I wasn't sure why it was like that, I guess Windows wants some extension, and it doesn't honour the shebang (doh). I suppose we could do that rename.

E.g. tif -> pdf would currently go something like this:

for %f IN (*.tif) do ( tesseract -l deu "%f" - hocr > "_out_%~nf.hocr" && python C:\Python38\Scripts\recode_pdf --from-imagestack "%f" --hocr-file "_out_%~nf.hocr" --dpi 600 -m 2 --hq-pages 1 --mask-compression jbig2 --denoise-mask fast --bg-downsample 3 -v -o "__out_%~nf.pdf" )

Just a side note, if you first do the tesseract calls, and then use hocr-combine-stream ( https://archive-hocr-tools.readthedocs.io/en/latest/#hocr-combine-stream ) you can pass a glob to --from-imagestack (and the combined hocr file).

No matter how it's tweaked, not great results for scanned book pages with photo illustrations so far. MRC doesn't seem appropriate use case for that, though I recall (ABBYY-generated?) PDFs on archive.org typically looking a bit better.

That could be, although I found it was usually quite similar to the Abbyy/LuraTech. If you can share an example we can take a look. There are a few variables to consider:

You can probably get decent compression ratios (in your case probably more if your input images are not JPEG2000) that way. The typical compression ratio for archive.org items is ~7-8x and ~2-3x if the entire thing is in high-quality mode. But we start with JPEG2000 images as input, which typically compress better than most other image formats (I know this is somewhat outdated).

If you want to fiddle with the --bg-compression-flags and --fg-compression-flags, look at the arguments that opj_compress takes (or grk_compress, or just the Pillow args). In any case, I think for the issue of quality with images/photos I would recommend to open a separate issue. NB: A collegue of mine is working on adding "ocr_photo" elements to the hOCR output of Tesseract, which would allow us to potentially special-case parts of the image that are considered to be a photo.