inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

PDF Optimization #1706

Open kaplun opened 7 years ago

kaplun commented 7 years ago

As part of the fact that the PDF will (likely) always be opened by default in the record editor we need to add a step in the normal workflow where an optimized PDF is generated.

An optimized PDF is one that has downgraded resolution for pictures, possibly not embedding default fonts and, what is more important, linearized (so that PDF.js can start to render the first page even when the rest of the PDF is still being downloaded).

To that aim, ghostscript is our swiss army knife: http://stackoverflow.com/questions/35370477/how-to-linearize-pdf-with-ghostscript http://stackoverflow.com/questions/10450120/optimize-pdf-files-with-ghostscript-or-other

An important aspect here, is that this PDF should be generated as part of the workflow, but is not the PDF that should be attached to a record. It will live for the time of a workflow and discarded at the end.

michamos commented 7 years ago

So what if we want to edit the record later? it will be slow?

If we go for Grobid, we could probably be more clever and generate PDFs with only the interesting pages according to Grobid (title page, references), same idea as what @fschwenn does now, with the full PDF as a backup in case Grobid gets it wrong.

kaplun commented 7 years ago

In theory we could always keep this low-resolution-linearized version hidden within the record. So that it can always be served to the cataloger.

Alternatively we can have the record editor to be opened through a configurable workflow that can indeed prepare the PDF as you suggest each time, just before.

StellaCh commented 6 years ago

This should be resolved in the record editor, if not, feel free to reopen it

michamos commented 6 years ago

no, this is about generating optimized PDFs in the backend in order to make loading times for curators smaller. So mainly a backend thing, and I am reopening it.