huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
230 stars 81 forks source link

Research PDF optimization #3760

Open txau opened 3 years ago

txau commented 3 years ago

There are a couple of improvements that may have value in our PDF process pipeline:

txau commented 3 years ago

Related issues: https://github.com/huridocs/uwazi/issues/2257 https://github.com/huridocs/uwazi/issues/1645 https://github.com/huridocs/uwazi/issues/1691 https://github.com/huridocs/uwazi/issues/1939

txau commented 1 year ago

(from https://github.com/huridocs/uwazi/issues/2257)

I've been running a lot of tests with your file and while I can't provide a detailed explanation of the source of the problem, I have a solution.

The problem is related to how each PDF reader interprets fonts and spacing. Normally this is done via internal heuristics and sometimes you can control word break spacing, sometimes you can't. In the case of PDF.js this is not possible. That is also the case for ghostscript, so a conversion to text with gs -sDEVICE=txtwrite -dBATCH -dNOPAUSE -sPageList=2 -sOutputFile=output.txt bla.pdf (only page 2 of your file with GS 9.50) tosses something like this:

INTRODUCTION Thefollowingobjectiveshavebeenadoptedindesigningtheseprotocols: TheprotocolsareconsistentwithexpressedKitchenuhmaykoosibInninuwug values, goals, and vision for the homeland. Theprocessforconductingmeetingsanddevelopingresponseswithin KitchenuhmaykoosibInninuwugshouldbesystematicandproduceconsistent responsesovertimeforsimilarapplications. Responsesshouldbebaseduponallrelevantandreliableinformationconcerning theapplication and its environment setting, including traditional knowledge and ecological knowledge. Theprocessshouldbeefficientandeasytoimplement. Responsesshouldbeproducedinatimelymanner. Positions taken in the responses should be transparent (i.e. supported by facts andanalysis).

On the other hand Poppler seems to handle this properly, so pdftohtml -f 2 -l 2 -xml -wbt 20 bla.pdf bla.xml outputs:

The protocols are consistent with expressed Kitchenuhmaykoosib Inninuwug values, goals, and vision for the homeland. The process for conducting meetings and developing responses within Kitchenuhmaykoosib Inninuwug should be systematic and produce consistent

Which is desirable, whereas tweaking the -wbt param (word break threshold) can also yield wrong results, ie pdftohtml -f 2 -l 2 -xml -wbt 26 bla.pdf bla.xml:

TheprotocolsareconsistentwithexpressedKitchenuhmaykoosibInninuwug values,goals,andvisionforthehomeland. Theprocessforconductingmeetingsanddevelopingresponseswithin KitchenuhmaykoosibInninuwugshouldbesystematicandproduceconsistent responsesovertimeforsimilarapplications. Responsesshouldbebaseduponallrelevantandreliableinformationconcerning

After many rounds of library testing and parameter tweaking I found a formula that actually fixes the PDF so it works properly both in ghostscript and PDF.js.

First sanizite and compress the file with mutools' convert with mutool convert -O garbage -O sanitize -O compress -o bla_clean.pdf bla.pdf

This generates a file that already works nicely with PDF.js bla_clean.pdf

As an additional step, you can also optimize the file for web, which not only reduces the size but also reorders the PDF contents so it will load faster; with GS this time:

gs -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -dEmbedAllFonts=true -dPDFA=2 -sProcessColorModel=DeviceRGB -dPDFACompatibilityPolicy=1 -dFastWebView=true -sOutputFile=output.pdf bla_clean.pdf

which reduced your file from 3.8MB to 264KB output.pdf and it is linearized (optimized for web).

Please note that if you have interest in preserving the original file intact (for archiving or chain of custody reasons) these steps create a new file with a completely new structure, so it is convenient to keep a copy of the old one. You can ie. leave the old one as a supporting file in Uwazi and use this one as the main file for online consumption.

On a side note, I'm not super familiar with all the GS options (-dPDFA=2, -dPDFACompatibilityPolicy=1) but I tried to make your file PDF/A compliant without luck so far (all text gets removed), so this requires further resarch. For the record I'm also looking into VeraPDF for accessibility and standards validation.

Sorry about the long answer, I'm using this thread as a log so maybe we can implement this as an Uwazi integrated feature so you and other users don't have to manually process the files. This of course requires more testing since this solution may be only working for your particular file.

I'm also trying to fix https://github.com/huridocs/uwazi/issues/1645 at the same time but it seems to be a different issue. We are also working on OCR integrated into Uwazi, which may help in some of these cases.

(I'll leave this issue open for team discussion).