Optimizing PDFs for use with low powered tablets and PDF.js Viewer

rjcorwin commented 10 years ago

The initial advice I received can be found here. @yurydelendik 's advice was as follows.

1) Avoid using high resolution images -- 150 dpi resolution for scanned images shall be enough for screen and for low powered devices; 2) Try to use JPEG encoding for color images/photos in RGB colorspace when possible; 3) Avoid using expensive compositions/effects such as transitions/masking -- flatten transparency; 4) Avoid using PDF generators (or don't create content) that produce ineffective PDF output, e.g. LibreOffice creates a lots of tiny images for vector elements/pictures it does not understand; 5) If there is a setting, use web-optimized PDF output / linearization; 6) Fix or don't produce corrupted PDFs that does not conform PDF32000 specification.

To start, I think it's worth seeing how much of a performance gain we get from doing something simple like Adobe Acrobat's Reduced Size PDF utility as you can see being used here. From Adobe's own website, here is their info on optimizing. Those links are about optimizing for file size though and for our purposes, if file size goes up and performance improves, that is an acceptable tradeoff.

Here's an example of a pdf that loads very slowly: Old French fairy tales

Here's an example of a pdf that loads quickly: Kevin's Birthday

Our last option is abandoning PDFs for another format like image collections. Some years ago when saw that some PDFs loaded slowly on tablets and that the UI wasn't that great for tablet viewing, I made an image book reader that you can find at https://github.com/open-learning-exchange/BeLL-Reader. PDFs are nicer for portability because they are one file as opposed to a folder full of images.

Yet, I have to think there must be some way to "flatten" a PDF's pages into images while still keeping them in the PDF format. I wonder if converting a PDF to a folder of images using something like PDFBox's PDFToImage tool and then creating a PDF from those images using Adobe Acrobat would do just that.

On creating a PDF from a collection of images:

"If you have the full version of Acrobat, it'll do it. I just select all the images, then drag them to the Acrobat icon in the dock. It'll create one multi-page PDF (though each page is the size of the image, it doesn't position them on a blank 8.5x11 page)."

bcipolli commented 10 years ago

Interesting discussion! I'm sure the folks over at the FLE will be interested to chime in at some point @aronasorman @jamalex @rtibbles @dylanjbarth .

I'll think about this as well! Thanks for sharing here!

rjcorwin commented 10 years ago

Hi @bcipolli @aronasorman @jamalex @rtibbles @dylanjbarth - Are you guys playing with PDF.js as well?

@jkhokhar I wonder what kind of performance gains we might see if we used PDFBox's PDFToImage converter and then used Adobe Acrobat's Combine Files into PDF task to convert those images into a PDF. I think we can follow some of @yurydelendik's advice by specifying some parameters when using PDFToImage and then making sure the resulting PDF we create doesn't try to do anything fancy.

rjcorwin commented 10 years ago

On a side note, I think HTML may be the holy grail of formats for Open Educational Resources but an HTML resource is often a folder of files as opposed to a file which makes the portability part harder. However, you can include images/styling/content all in one HTML file, it's just not done very often. I wonder if there is a good editor out there tailored towards making single file HTML resources.

bcipolli commented 10 years ago

@rjsteinert Not yet, but PDF is an important OER use-case as well, so always on the radar!

Even if there's not an editor that allows images/tyling/content in a single HTML file, perhaps there's a tool that will do it for ya. Kind of like tools that min-imize JS files, make sprites, or other tricks for the web.

P.S. How do you embed an image into a HTML file? Base64-encode as a string?

rjcorwin commented 10 years ago

Kind of like tools that min-imize JS files, make sprites, or other tricks for the web.

That would be great. It would not be hard to create a script in node/python/php to do that, lowering the barrier to using that would be the trick.

P.S. How do you embed an image into a HTML file? Base64-encode as a string?

Base64https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding, exactly. I've experimented with making self-editable-HTML-documents by including Aloha editor with some extra JS to control saving and file name. It works on CouchDB (a RESTful file system). Unfortunately base64 encoding does increase the file size quite a bit. Whether or not that increase in weight is worth it I'm not sure. Yet, aren't most files actually folders with lots of files inside of them? Maybe what we really need is a spec for a new file format. Microsoft Word like format (XML), HTML specs inside.

On Thu, May 8, 2014 at 6:08 PM, Ben Cipollini notifications@github.comwrote:

@rjsteinert https://github.com/rjsteinert Not yet, but PDF is an important OER use-case as well, so always on the radar!

Even if there's not an editor that allows images/tyling/content in a single HTML file, perhaps there's a tool that will do it for ya. Kind of like tools that min-imize JS files, make sprites, or other tricks for the web.

P.S. How do you embed an image into a HTML file? Base64-encode as a string?

— Reply to this email directly or view it on GitHubhttps://github.com/open-learning-exchange/BeLL-Apps/issues/33#issuecomment-42612552 .

rtibbles commented 10 years ago

OERPub have been building on top of the Aloha editor as well. https://github.com/oerpub is probably worth checking out to avoid reinventing the wheel, and to allow the importing of quality OER content easily. Siyavula (a South African based company seeded by the Shuttleworth Foundation) is using this format to produce OER textbooks in Maths and Science.

On Thu, May 8, 2014 at 3:41 PM, R.J. Steinert notifications@github.comwrote:

Kind of like tools that min-imize JS files, make sprites, or other tricks for the web.

That would be great. It would not be hard to create a script in node/python/php to do that, lowering the barrier to using that would be the trick.

P.S. How do you embed an image into a HTML file? Base64-encode as a string?

Base64< https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding>,

exactly. I've experimented with making self-editable-HTML-documents by including Aloha editor with some extra JS to control saving and file name. It works on CouchDB (a RESTful file system). Unfortunately base64 encoding does increase the file size quite a bit. Whether or not that increase in weight is worth it I'm not sure. Yet, aren't most files actually folders with lots of files inside of them? Maybe what we really need is a spec for a new file format. Microsoft Word like format (XML), HTML specs inside.

On Thu, May 8, 2014 at 6:08 PM, Ben Cipollini notifications@github.comwrote:

@rjsteinert https://github.com/rjsteinert Not yet, but PDF is an important OER use-case as well, so always on the radar!

Even if there's not an editor that allows images/tyling/content in a single HTML file, perhaps there's a tool that will do it for ya. Kind of like tools that min-imize JS files, make sprites, or other tricks for the web.

P.S. How do you embed an image into a HTML file? Base64-encode as a string?

— Reply to this email directly or view it on GitHub< https://github.com/open-learning-exchange/BeLL-Apps/issues/33#issuecomment-42612552>

.

— Reply to this email directly or view it on GitHubhttps://github.com/open-learning-exchange/BeLL-Apps/issues/33#issuecomment-42615287 .

Richard

rjcorwin commented 10 years ago

@rtibbles Very cool. It looks like we're on the same page as OERPub. I would be interested in connecting with them to see a demo of what they're working on. Has FLE connected with OERPub much?

rjcorwin commented 10 years ago

In other news, the Opera team posted on the Mozilla blog today about PDF.js benchmarking. The conclusion seems to be that PDF.js isn't much slower than native rendering.

rtibbles commented 10 years ago

Apart from running into them briefly at the OER Conference last year, not had much contact. Just had it as one of those things to keep on my radar as we push forward with expanding content types and creation.

rjcorwin commented 10 years ago

For potential use later, in the comments of that Opera post on Hacker News, PDFKit was recommended for scripting the creation of PDFs.

rjcorwin commented 10 years ago

Good news on the efforts to optimize PDFs for low powered tablets. My experiments with converting PDFs to images and then back into PDFs has significantly improved the render time for PDFs on the tablets we have the OLE office. For text based PDFs the downsides include lower readability and larger file sizes (about doubled) but for a PDF like the Old French fairy tales mentioned above, the file size stays the same and it looks just as good. The program I used to accomplish these two tasks of rendering as images and then converting back to PDF was the most recent release of Adobe Acrobat. That program is a bit pricey so we'll look into free tools with and eye for automating the process.

rtibbles commented 10 years ago

Does that not have the issue of turning the text content into images only? My concern would be that this would reduce the ease of extracting strings from this kind of content to allow for easy crowdsourced translation.

On Sat, May 10, 2014 at 2:18 PM, R.J. Steinert notifications@github.comwrote:

Good news on the efforts to optimize PDFs for low powered tablets. My experiments with converting PDFs to images and then back into PDFs has significantly improved the render time for PDFs on the tablets we have the OLE office. For text based PDFs the downsides include lower readability and larger file sizes (about doubled) but for a PDF like the Old French fairy tales mentioned above, the file size stays the same and it looks just as good. The program I used to accomplish these two tasks of rendering as images and then converting back to PDF was the most recent release of Adobe Acrobat. That program is a bit pricey so we'll look into free tools with and eye for automating the process.

— Reply to this email directly or view it on GitHubhttps://github.com/open-learning-exchange/BeLL-Apps/issues/33#issuecomment-42754554 .

Richard

rjcorwin commented 10 years ago

@rtibbles Good point. It may be the case that for every resource we have a "source file" and then additional files optimized for particular devices and particular languages. Bonus points for being able to build those additional files on demand from the parent ground server.

rjcorwin commented 10 years ago

@aronasorman @jamalex @rtibbles @dylanjbarth @bcipolli As you guys may know I'm working mainly on Farm Hack now but I'm still involved with OLE and especially with Ground Computing for Farms (building a temperature alarm system on a Raspberry Pi). I brain dump from time to time in the Ground Computing Google Group, it would be cool to have you guys also there to share stories around Ground Computing.

yurydelendik commented 10 years ago

For text based PDFs the downsides include lower readability and larger file sizes

If PDFs has text and embedded font, I would not recommend to convert them to images. If PDFs are scanned (color) images, the best choice is convert them to JPEG with low DPI. I don't not test that on low end devices, but try low-DPI JBIG2 for black and white scanned images -- some JBIG2 generators are trying to recognize repeating blocks/patterns (such as letters) minimizing the size, speeding up the display, and adding OCR'ed text layer for search.

rjcorwin commented 10 years ago

Thanks for the suggestion @yurydelendik. I'll continue to play with optimization strategies. The big bummer for text based documents is that even the demo file for PDF.js, which seems like a very simple text file, crashes the tablets we use. If you are interested in seeing what the experience is like, some of these cheap tablets have started to make it to market here in the US.

yurydelendik commented 10 years ago

You can analyze the structures of "fast" and "slow" PDFs using our http://brendandahl.github.io/pdf.js.utils/browser/ tool, e.g.:

oldfrenchfairyta00sgrich.pdf encoded using JPX (JPEG2000) encoding with 300DPI, which requires lots of processing power and memory from any reader to decode that;
kevin's-birthday.pdf encoded using DCT (JPEG) encoding with 90DPI (might be too low, but okay for this PDF).

(See Root->Pages->Kids...->Resources->XObject to embedded images)

yurydelendik commented 10 years ago

@rjsteinert keep in mind, for low end devices such as Firefox OS phones (you may try the simulator) we created different viewer (the pdf.js engine is the same) to be less memory hungry, see http://107.21.233.14:8877/6f01ace1d4111ff/extensions/b2g/content/web/viewer.html as example.

rjcorwin commented 10 years ago

@yurydelendik Oooo, that PDF browser will come in handy. Question: How would I calculate DPI on the example below?

img4 (stream) [id: 18, gen: 0]

    Subtype = /Image
    Width = 984
    Height = 1406
    ColorSpace = /DeviceRGB
    BitsPerComponent = 8
    Interpolate = true
    Length = 5137
    Filter = /JPXDecode

On the Firefox OS point, I was wondering how PDF viewing was going to work because of what I read that the Firefox OS project was focusing on low end devices first. While that goal may be more difficult it makes me very hopeful that we might be able to see Firefox OS devices in Open Learning Exchange deployments in the near future without having to wait for today's high end technology to drop to affordable price points.

Thanks again!!

yurydelendik commented 10 years ago

How would I calculate DPI on the example below?

It depends how big is the ~~original~~printed image (at 100%). Let say your image takes full page 8.5"x11". DPI is dots per inch. For width DPI is 984 / 8.5 ≈ 115, for height - 1406 / 11 ≈ 128

On the Firefox OS point, I was wondering how PDF viewing was going to work because of what I read that the Firefox OS project was focusing on low end devices first. While that goal may be more difficult it makes me very hopeful that we might be able to see Firefox OS devices in Open Learning Exchange deployments in the near future without having to wait for today's high end technology to drop to affordable price points.

Did you apply for https://hacks.mozilla.org/2014/02/open-applications-tcp/ ? Also cc'ing @wfwalker

rjcorwin commented 10 years ago

@yurydelendik @wfwalker Unfortunately the Tablet Contribution Program slipped by our radar. Those tablets are quite beefier than what we are using. We currently have a Cortex A8 1.0 Ghz processor and 512mb of RAM. Do you guys know what the price points of those tablets might be when they become available on the market?

Now for an update on our PDF strategy. We gathered a collection of 8 PDFs that we think represents the spectrum of different kind of PDFs out there and ran three methods "optimization" on them.

An automated process we developed
A manual process using Adobe Acrobat
Converting them to images and using the BeLL Reader app to view them

Our QA team evaluated the results from each method and it has been determined that while methods #1 and #2 now prevents crashes, the user experience is not at the high level we are trying to hold ourselves to. Method #3 on the other hand is quite satisfactory. Given that this situation may evolve over the coming months, whether it be with faster affordable hardware or optimized PDF.js code, we're going to maintain render image file versions of each PDF while maintaining the original PDF version for when either of the two shoes drop.

Thank you everyone for your help and we'll keep you updated as the situation evolves.

bcipolli commented 10 years ago

@rjsteinert Any observations on file size difference between the four PDF methods (original and 3 mod'd)?

open-learning-exchange / BeLL-Apps

Optimizing PDFs for use with low powered tablets and PDF.js Viewer #33