sumatrapdfreader / sumatrapdf

SumatraPDF reader
http://www.sumatrapdfreader.org
GNU General Public License v3.0
13.06k stars 1.68k forks source link

Extremely slow opening of CBZ with JP2 images #1922

Open vrubleg opened 3 years ago

vrubleg commented 3 years ago

There are a lot of scanned magazines on the archive.org which are available as a zip archive with JP2 (JPEG 2000) images. SumatraPDF is able to open them as CBZ files, but it takes a couple of minutes just to open a file, and switching between slides is also very slow. It seems like it tries to decode all images from the archive on open, and JP2 decoder is extremely slow.

How to reproduce:

  1. Download this archive: https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997_jp2.zip
  2. Change file extension to CBZ.
  3. Try to open it using SumatraPDF.
vrubleg commented 3 years ago

Another issue, probably related. There are _text.pdf files on the archive.org for magazines, and they are also rendered very slowly. Example: https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997_text.pdf

The same magazine in usual PDF and DJVU is rendered quickly. Example: https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997.djvu https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997.pdf

Seems like the _text.pdf version uses JPEG 2000 as an image codec, and it is the reason why it is so slow. All these file types are standard for archive.org, there are hundreds of scanned magazines which use the JPEG 2000 format. It is worth to consider finding a faster decoder.

GitHubRulesOK commented 3 years ago

I avoid any overly compressed files when possible, they were necessary 40 years ago when using 9600 baud modems to minimise transmission times, they are literally a waste of time in this day and age. I understand Internet Archive would tend to use near lossless storage, but 1 hour compressing and millions of users hours decompressing makes no sense at all.

From Wikimill article on jp2 "Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission." (Thus no consideration of end users needs, after all they have paid for the privilege to bog themselves down.)

GitHubRulesOK commented 3 years ago

Interesting as to where those docs came from (clearly an amateur as the 56 page spread was hastily scanned as 58 images) CbZip smallest Page.4 (5) is 5000 x 4600 pixels so translates into poster size page 1322.9 mm x 1217.1 mm, my dining table is roughly that size! so would need a 16K monitor to be of any value.

_text.pdf Cover page 1 and .4 report they are 423.3 mm x 389.5 so more like my coffee table and if saved as lossless png (even with all the jpeg garbage) at that giant size as 190MB.CBZ they display instantly same as the crappy jpeg version.

Jpeg is for best for photos NOT docpages. Png is best for most colour documents especially printed ones. LuraDoc Recoded into higher density compression which was primarily designed to handle tiled mapping / aero photos to be radio downlinked and viewed a few at a time, so my view is it should not be used except for single satellite images where you need to pick out the buildings in detail and can wait a while. I guess the Inter Galactic Archive will need to keep them in that format for interplanetary reading.

GitHubRulesOK commented 3 years ago

I ran the 56 pages through irfanview to convert that Last Century j2k wavelets into a modern webp so this cbz will work in SumatraPDF but will not work in older CB readers

SumatraPDF-56xWebp.zip

Also there are much better compressors now so try this format its much smaller but not extreme avoiding the decompression chamber delay. 56images_compressed.pdf

vrubleg commented 3 years ago

I don't create these files, so I can't choose some other format for images. I just need to view these magazines from the archive.org as is. In most cases, archive.org stores original magazine scans as JP2 files, and it would be really nice if SumatraPDF didn't slow down that much in this case.

GitHubRulesOK commented 3 years ago

That Luradoc format is proprietary so if badly applied is a problem for many non-commercial libraries to find workarounds. MuPDF have withdrawn their notice of Luratech code inclusion so I suspect the situation may be a "wont fix" as the code is unlikely to become FOSS. I noted that Internet Archive throttle their downloads so they are slow on a high speed downlink. Thus I found the quickest method was to send the url to a cloud de/compressor wait a few minutes for them to suffer the throttling and then at high speed quickly download their bigger / faster decompressed file.

kjk commented 3 years ago

It's true that the way we open archives with images is sub-optimal.

To do the layout, we need to know the size of all pages. Currently we need to decompress full jp2 image to get the size. It's probably possible by only reading the header of the image.

We should also remember mediaboxes for all pages in settings so that the second open doesn't even need to decompress.

Also, we should load / decompress on a background thread and decompress images to memory so that we don't need to keep the archive open / available anymore.

Not trivial.