internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
104 stars 14 forks source link

Usefulness of MRC for decent quality compression of scanned book pages with illustrations #33

Open fusefib opened 2 years ago

fusefib commented 2 years ago

Opening a new issue as requested.

Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA

128.tif & 188.tif - original cleaned up 600dpi scans -scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected -scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using pdfbeads *.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.

MerlijnWajer commented 2 years ago

I took a look and I have a few thoughts. The damage to the photos comes mostly from parts of the photo being marked as background and others as foreground. Ultimately, MRC is not ideal for photos, but I think we can come up with something that is quite workable if we can figure out what parts are just images.

  1. Having photo information in the hOCR file can help us identify photo regions (see https://github.com/internetarchive/archive-pdf-tools/issues/23). This needs to be added to Tesseract's hOCR renderer and a colleague of mine is working on it.
  2. Scantailor seems to generate useful output where it separates the background and the foreground (alternative to (1)). I need to think about how we can use that output exactly (maybe you could provide a mask where we are sure there is only background). One option would be to have a way to already provide the background and foreground images separately, but I'd have to rewrite parts of the code to make that work without too many hacks.
  3. The photos in your image have a lot of digitisation/camera noise, blurring some of those parts might also help with mask generation or compression. However, currently archive-pdf-tools will only blur an entire image/page if it deems it too noisy.

If we have a good idea of what is text and what is photo, we can attempt to use JPEG2000 Region of Interest encoding, and we will also have the mask exclude any/all parts of the photo. Then we can encode the photo as part of the background, and try to get higher quality at the regions where we think we have photos. openjpeg/grok has some form of ROI and kakadu also has -roi in combination with Rshift/Rweight/Rlevels. I haven't gotten this to work in the past, but maybe we ought to re-try when we have ocr_photo support in hOCR files.

So to summarise, I think we can make the software handle this better if it knows what regions are images, ideally we get that from the hOCR file, but we can think of another way provide that information through scantailor (or custom code).

BTW: You can still get better compression currently than pdfbeads by just providing higher quality --bg-compression-flags and --fg-compression-flags.

Useful links:

MerlijnWajer commented 2 years ago

This link: https://www.researchgate.net/publication/281283716_The_Significance_of_Image_Compression_in_Plant_Phenotyping_Applications

Suggestion to perform roi like this:

Lossy (ROI): kdu_compress -no_weights Rshift=16 Rlevels=5 
-roi roifile,0.5 -rate r 

I could give that a try later this week.

rmast commented 2 years ago

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

I've not seen that working either. @trufanov-nok has some similar work on getting those split files in a .djvu, but I've not tried them yet.

MerlijnWajer commented 2 years ago

"Merging" them as layers is not possible in PDFs, but you can have images on top of each other with transparency. Or you can merge them before you insert them. But that wouldn't be necessary if we try use some of my above comments. I have never used scantailor but it looks cool, maybe we can support using scantailor to clean up documents some.

rmast commented 2 years ago

I saw JPEG2000 also has a composite JPM format, meant for MRC. I don't know if that has more possibilities than PDF already has, but as JPEG2000 is part of PDF one would expect those JPM-possibilities to be usable in a PDF.

MerlijnWajer commented 2 years ago

I don't think it really matters, you'd still be encoding the JBIG2 and JPEG2000 images separately in the JPM (which what we do in the PDFs too, at little overhead), but JPM support is non existing in almost all the tooling as far as I can tell, making it not a great thing to target.

fusefib commented 2 years ago

@rmast The option is available with Scantailor Advanced, which is still overall better than ScanTailor Universal. https://github.com/4lex4/scantailor-advanced/releases

I used Scantailor Advanced's Picture Shape -> Rectangular, Sensitivity (%): 100% and checked Higher Search Sensitivity. I haven't looked at the code yet. It worked well perhaps because the scan was already cleaned up. But a similar way to detect image regions, then lowering the compression for those regions could be a safe, across-the-board solution for large-scale automated tasks like @MerlijnWajer has in mind.

It also has a Splitting box when outputting in Mixed mode, and that's where you get the files.

PDFbeads performs the splitting separately, though based on Scantailor mixed output (though IIRC it can do something on its own with some options).

@MerlijnWajer Do you happen to be aware of any existing tooling that can do one of the things that pdfbeads does, namely make the JBIG2 a transparent layer in the PDF (I guess that's what happens, then the downscaled image gets underlaid), but from a specific JBIG2 tiff input?

MerlijnWajer commented 2 years ago

@MerlijnWajer Do you happen to be aware of any existing tooling that can do one of the things that pdfbeads does, namely make the JBIG2 a transparent layer in the PDF (I guess that's what happens, then the downscaled image gets underlaid), but from a specific JBIG2 tiff input?

Not sure if I follow. The way archive-pdf-tools works, overly simplified:

  1. Load image, create MRC components (mask, foreground, background)
  2. Compress foreground, background as JPEG2000 (after "optimising" them), Compress mask as JBIG2 or CCITT
  3. Paste background in the PDF page.
  4. Paste foreground in the PDF page, over the background image, with the mask as alpha (transparency) layer.

This is actually visible if your computer is sufficiently slow: first the background image will finish decompressing, at which point you will see it, and only later the "text" (foreground) layer appears.

So it sounds like it does what you're suggesting, right?

MerlijnWajer commented 2 years ago

Or rather, adding an image with JBIG2 as transparency layer is what it does already -- so we have code that can do it, iiuc.

trufanov-nok commented 2 years ago

This discussion is popped up in my notifications, and I'm not sure if this is relevant, but I would like to note that ScanTailor is a some kind of semi-automatic text-to-image segmentator. And by default it outputs a single image. But it seems 12 years ago the author have decided to reserve pure white (0x??FFFFFF) and pure black (0x??000000) to the text parts and there is no such colors in illustration parts of the result. I mean if scantailor treats something as illustration - the variability of colors of its pixels is limited to all colors except two. And you can't expect to find pure black or pure white pixels there. It seems the "export" functionality was introduced 7 years ago in ScanTailor Featured and it was adapted as legacy by the currently active forks - Advanced and Featured. But basically it just reads the output image pixel by pixel and writes it to the two different image files: one b/w only, one with everything except it. We call them "layers" but that's just a reference to a so called "methods of a separate layers" - an approach of assembling a DjVu documents bypassing the fact that opensource djvu encoders lacks text-to-image segmentators at all and the commercial one can make mistakes. One of the output files gets a ".sep" suffix and such pair of imeges is designed to be used with "DjVu Imager" application. The idea is to encode (with commercial or opensource encoder) the bundled b/w DjVu document and later automatically insert the illustrations into it with DjVu Imager by matching the ".sep" files to the corresponding pages by filenames. So you don't need to rely on automatic segmentation at all. Which gives you a best looking illustrations. The ScanTailor versions (I guess... all of them..) that reserve b/w colors to a text part in outputted images could be identified by the presence of the reserveBlackAndWhite function in the sourcecode. I mean, one don't need the export functionality if he can read the ST processed image pixel by pixel. The "export" could be done on the fly.

rmast commented 2 years ago

So one of the issues with background pictures containing fuzz behind the foreground is not possible with this reserveBlackAndWhite output. I don't think the surrounding pixels of that reservedBlackAndWhite are cleaned up by ScanTailor as is documented with DjVu, by meeting vectors of gradients in the original picture only extending the background-vector.

Redsandro commented 2 years ago

If we have a good idea of what is text and what is photo, we can attempt to use JPEG2000 Region of Interest encoding, and we will also have the mask exclude any/all parts of the photo. Then we can encode the photo as part of the background, and try to get higher quality at the regions where we think we have photos. openjpeg/grok has some form of ROI and kakadu also has -roi in combination with Rshift/Rweight/Rlevels. I haven't gotten this to work in the past, but maybe we ought to re-try when we have ocr_photo support in hOCR files.

This is relevant to my interests. I'm also curious to see if ROI compressing a scan to JP2 using the mask recode_pdf generates yields a good image. Just using that single image in the pdf as mixed forground/background without a mask may be an interesting middleground where we don't have to use a (relatively slow) jbig2 mask, but still have a better compression ratio than classic jpeg compressed PDFs.

If it helps, using a mask image with kakadu is discussed briefly in Advanced JPEG 2000 image processing techniques:

kdu_compress -i image.ppm -no_weights -o image.jp2 -precise -rate - \
  Cblk={32,32} Ckernels=W9X7 Clayers=12 Clevels=5 Creversible=no Cycc=yes \
  Rweight=16 Rlevels=5 -roi mask.pgm,0.5

This example compresses a color image losslessly using the ROI "Significance Weighting" method, using an image mask to specify the ROI. (...) [The] distortion cost function which drives the layer formation algorithm is modulated by region characteristics.

Reid, J. (2003). Advanced JPEG 2000 image processing techniques. Proceedings of SPIE, 5203(1), 223-224.

MerlijnWajer commented 2 years ago

@Redsandro - right, please feel free to try and toy around with kdu_compress ROI encoding. I have not added an option to dump all images as lossless (say png or tiff) before encoding as a debug feature, but I could add that if you plan to toy with it. I had very limited luck trying to use the kakadu ROI encoding, but I might have done it wrong.

rmast commented 2 years ago

I don’t think a 1:1 mask generated from a binarized picture would reveal regions of interest. Most of the page is just fuzzy black or fuzzy white. A page with a fuzzy signature could probably benefit from this ROI for the signature. Question is whether recognition of those ROI spots can be automated or needs a manual activity.

MerlijnWajer commented 2 years ago

If you can get ROI encoding working in Kakadu, I can add support for the hOCR ocr_photo element, which Tesseract can now do: https://github.com/tesseract-ocr/tesseract/pull/3710 - that's probably a good start, it won't help with comics in particular I suspect.

Redsandro commented 2 years ago

@rmast commented:

please feel free to try and toy around with kdu_compress ROI encoding.

After toying around, I observe the Rweight and Rlevels control the deviation from the compression based on rate. The number after the mask specifies the baseline value in the mask between 0 (black) and 1 (white).

kdu_compress -i in.tif -o out.jp2 -no_weights -rate 0.5 Creversible=no Rweight=16 Rlevels=2 -roi mask.pgm,0.5

mask.pgm image

in.tif image

By default the ROI mask consists of 128x128 pixel patches.

out-default.jp2 image

To make a more accurate mask, you need to set Cblk, e.g.: Cblk={16,16}. (You probably need to escape that with your shell, e.g. Cblk=\{16,16\})

out-16.jp2 image

The thing to keep in mind though is that setting a lower Cblk makes the entire code block smaller causing less efficient encoding and a bigger file, but not by a lot if you keep it sane and don't go below 8x8. The default was 64x64 in an older version and is now 128x128. Perhaps since no one is using ROI anyway it was worth saving those extra kb's, so mask accuracy comes at a cost. Maybe worth it, but that would require some experimenting on the relevant data. Perhaps using this experiment as a starting ground, you can get better results.

If you want to know more about flags, this is helpful, although some of the defaults are different on my build/machine.

MerlijnWajer commented 2 years ago

Hey, looks like you actually got it to work. That's great. I'll try to look at how we can use/integrate this to compress better (accuracy / size).

rmast commented 2 years ago

When I think of a way to get the PostNL-bill compressed that I used before as a test-subject I could imagine to use the high-density part for the square ocr_photo-frame around the logo as ROI. The grey dithered drawings on the bottom are not found as ocr_photo by tesseract, so they would just end up grayish outside the ROI. That way, using the ROI in the backgroundpicture I would expect the ROI to keep some quality for the logo on the background. I'm curious whether this would give a better quality/compression.

MerlijnWajer commented 2 years ago

@Redsandro -JFYI I still plan to work on this, I just had a long work trip and am only just coming back to this, and these kinds of improvements are more or less spare-time projects. Maybe in a few days I can make a branch with this integrated.

One thing we'll need is some testing framework to do comparisons (PSNR, SSIM, etc). I think I made a start with that, so we could compare to see how well ROI helps with compression ratios and quality.

Redsandro commented 2 years ago

JFYI I still plan to work on this, I just had a long work trip and am only just coming back to this, and these kinds of improvements are more or less spare-time projects.

No problem. I understand. You may want to manually try what you had in mind initially to see if it is roughly as useful as we hope. Because if the quality over compression ratio is really not close to interesting when keeping in mind the code block size limitations, it would be a waste to set up a lot of scaffolding.

rmast commented 2 years ago

The grey ABN AMRO-text on top of the ABN-AMRO-letter is recognized by tesseract as text size 75 in a bounding box. The shield-logo to the left appears to be recognized as an apostrophe in a bounding box. Would you use the ROI detail for compressing these bounding-boxes in JPEG2000, or would you still use JBIG2 to make sharp mask boundaries to a rough color picture? Those letters are just grey on white, the shield-logo is green/yellow on white.

MerlijnWajer commented 2 years ago

I figured we'd still use the JBIG2, and just get more quality for the parts that we care about. We could see how it works without the mask, but I'm a bit sceptical.

rmast commented 2 years ago

So all text will be masked by JBIG2 colored by a low res mask coloring picture and photo-elements will get ROI attention on the background picture. Usually that means text 300 dpi, background picture 100 dpi?

MerlijnWajer commented 2 years ago

I have pushed some code here: https://github.com/internetarchive/archive-pdf-tools/tree/roi

The wheels will end up here: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276084863

I can run it like so (for testing purposes):

recode_pdf -v --bg-compression-flags '-no_weights -rate 0.005 Cblk={16,16} Creversible=no Rweight=16 Rlevels=2' --fg-compression-flags '-no_weights -rate 0.075 Cblk={16,16} Creversible=no Rweight=16 Rlevels=2' --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-roi.pdf

ROI mode is currently enabled when "Creversible=no" is found in the flags (literally) - and that is a hack, I know.

The background seems to improve with the mask, the foreground not so much? (For the background, we use the inverted mask) - I hope I didn't swap the inverted-ness for background/foreground.

With the above parameters the size is about the same as without roi and default slope values.

Maybe give it a try?

MerlijnWajer commented 2 years ago

I have also pushed a commit where I swapped the revertness, which will build separately as an action: https://github.com/internetarchive/archive-pdf-tools/actions/runs/2276139177

I also changed the rate from 0.005 to 0.01 for bg compression flags locally, so you might want to update my command line for that, I think it is a more fair comparison. Here is my command with downsampling added in:

recode_pdf -v --bg-compression-flags '-no_weights -rate 0.01 Cblk={8,8} Creversible=no Rweight=16 Rlevels=2' --fg-compression-flags '-no_weights -rate 0.075 Cblk={8,8} Creversible=no Rweight=16 Rlevels=2' --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out-roi.pdf --bg-downsample 3

vs the 'normal':

recode_pdf --bg-downsample 3 -v --dpi 300 -J kakadu -I /tmp/in.png --hocr-file /tmp/in.hocr -o /tmp/out.pdf

The background definitely looks less noisy.

MerlijnWajer commented 2 years ago

It might even also make sense to use different Cblk values for the background and the foreground, I can imagine. For the background we maybe don't need the regions to be that small, but for the foreground we likely do want that.

MerlijnWajer commented 2 years ago

roi-diff

Definitely seems to make a difference for background noise...

MerlijnWajer commented 2 years ago

I think in general this looks like it can offer an improvement, but I'll need to think how it can be integrated properly, maybe it's time to offer some encoding "profiles", so that people can pick without having to fiddle with the exact OpenJPEG flags, or kakadu rates, etc...

rmast commented 2 years ago

I experimented with didjvu via c44 a while ago. https://github.com/jwilk/didjvu/issues/19 With subsample ratios 3 to 5 for the background picture c44 was able to almost clear the background (I guess by using the patented vector estimation to filter out the surrounding fuzz resulting by partial-pixels with a color between the fore and background). The patent has expired and c44 is open source. So to clear the background fuzz there might be another option.

MerlijnWajer commented 2 years ago

@rmast - could you share some command lines to go from a tif/png/pgm/etc background image (before I "optimse" them, but after "removing" the foreground) to a djvu component which is then converted back to png? That would ease testing.

MerlijnWajer commented 2 years ago

The ROI encoding I think ought to be useful in any case (at least in theory).

rmast commented 2 years ago

I looked into the working of didjvu calling c44 with masks. It appears to use mask.erode and mask.dilate before calling c44 for the fore- and background to clear the foreground and background-fuzz, so I was probably wrong in assuming c44 does the trick.

MerlijnWajer commented 2 years ago

Could you provide some literal steps I could try to reproduce what the didjvu stuff does for the background generation?

MerlijnWajer commented 2 years ago

BTW: this branch contains a bunch of test images (and hocr files to go along with them) in case you wanted to try the djvu stuff on other examples: https://github.com/internetarchive/archive-pdf-tools/tree/tests/tests

rmast commented 2 years ago

It's as if you can read my mind about those test-cases, I looked up your above example to replay some scenarios. This is the default result of DjVuSolo3.1/DjVuToy, the text unfortunately isn't correctly thresholded: 0 - sim_english-illustrated-magazine_1884-12_2_15_0000 - 2414 x 3560.pdf

This unfortunately has no command-line replay steps.

This is the result of didjvu followed by DjVuToy to make a djvu back into a PDF: english-magazine.pdf

The first step is done by ~/didjvu$ ./didjvu encode -o english-magazine.djvu ~/0\ -\ sim_english-illustrated-magazine_1884-12_2_15_0000\ -\ 2414\ x\ 3560.png --fg-slices 100

In the original the text from the other side of the page shines through. The best thresholding-algorithm in my opinion would just wipe the other side, but leave the letters on top complete without dents, and still not glue letters together ending up in just a bitonal image instead of a MRC picture. Usually finding the best threshold appears to be a manual selection process, however I saw some attempts involving a GPU to do some heavy AI on it, already using knowledge of characters to do the thresholding.

I am now distracted by getting a better bitonal picture from this example. It's quite a difficult example for thresholding to a clear bitonal picture of the intended print on one side of the paper.

MerlijnWajer commented 2 years ago

Yeah, there are some ways to improve the sim_english scenario, but they then again cause issues with other scenarios. The DjvuToy background looks good, but it does seem like it might mess up images a bit more.

rmast commented 2 years ago

For example this binarizer is tempting to try, despite the somewhat open characters in the example-result: https://github.com/NVlabs/ocropus3-ocrobin

MerlijnWajer commented 2 years ago

There actually is an ocropus 4 in the works which is likely going to be faster/better: https://github.com/ocropus/ocropus4 - I've talked about it with Tom in the past, but I haven't been able to dedicate my time to help out too much.

At this point, should we create a separate issue for trying to figure out how DjVu implementation can maybe help? Or maybe scantailor can? Maybe it makes sense to make an issue with an overview the various other projects.

rmast commented 2 years ago

I guess we should look for some good examples that will really benefit from MRC. Your covid-manually-filled-health form is a good example. I also tend to take letters with a logo and an autograph, however I should anonymize some.

Black and white examples are an invitation for other approaches. I tried scantailor universal and advanced today on your english example, but wasn´t satisfied with the offered binarization. Even a Gimp binarizing/smoothing filter from the diybookscanner.org/forum didn´t satisfy me, although it was the first time it had a result with default settings. I keep seeing small lines in characters that disappear and dent characters when binarizing.

rmast commented 2 years ago

I now tried kraken -i ~/sim_english-illustrated-magazine_1884-12_2_15_0000.tif bw.png binarize in the hope for trying the nlbin - algotrithm, and wasn't convinced at all by the result. Many dents in the characters for which I'm able to see the lines in the original picture. Binarizing an old black and white print is really another, and still not satisfactorily solved challenge.

rmast commented 2 years ago

I now found a possible reason for all those dents in the results of the binarization by neural networks. The main contest for binarization, DIBCO, uses ground truths that introduce such dents by using some cheap binarizer to begin with to create the ground truth image. Picture 14 of dataset 2017: image It's assumed ground truth: image

The left character of the real image doesn't contain a hole on the right side, the ground truth does.

The sign on top of the second character from the left has a more outspoken separation from the character in the original image than in the ground truth. With such bad datasets to begin with I don't expect any quality improvement over the original binarizer that appears to be used for the original ground truths.

rmast commented 2 years ago

If you look at the Google-scan of the same book then Google alternates bitonal and greyscale-pages where images are visible:

https://babel.hathitrust.org/cgi/pt?id=mdp.39015056059697&view=1up&seq=145&q1=gainsborough


Van: Merlijn Wajer @.> Verzonden: zaterdag 7 mei 2022 17:52 Aan: internetarchive/archive-pdf-tools @.> CC: rmast @.>; Mention @.> Onderwerp: Re: [internetarchive/archive-pdf-tools] Usefulness of MRC for decent quality compression of scanned book pages with illustrations (Issue #33)

Yeah, there are some ways to improve the sim_english scenario, but they then again cause issues with other scenarios. The DidJu background looks good, but it does seem like it might mess up images a bit more.

— Reply to this email directly, view it on GitHubhttps://github.com/internetarchive/archive-pdf-tools/issues/33#issuecomment-1120231522, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZPZ5XRHAQISIYFKYBHY6TVI2GS5ANCNFSM5JO3SD5A. You are receiving this because you were mentioned.Message ID: @.***>