Can we extract attachments from a PDF?

tigran123 commented 12 years ago

Does anyone know if mupdf can extract attachments from a PDF file?

tigran123 commented 12 years ago

Ok, pdftk can do it, but one still has to build it for Kindle. That is a very nice cross-compile project for me.... (or maybe I'll look at its source and just extract the attachment extraction code into a separate utility, or even try to build it into KPV itself :)

chrox commented 12 years ago

@tigran123 I disagree with you that converting Djvu to PDF will grow file size by 200-500%. I know a tiny program called DjvuToy which can make nearly loss-less conversion from DJVU to PDF and keep nearly the same file size for most DJVU files. Believe me, I have tested it dozens of times and converted all my DJVU files to PDF files. And I said it works for most DJVU files just because the author of this program said so. In reality it never fails in my conversions.

You can follow this link to download it http://www.comicer.com/stronghorse/software/index.htm#DjVuToy for free. But what's pity is that it's close-source and runs only on Windows or Wine emulator.

The author of DjvuToy wrote a wonderful article about converting DJVU document to PDF document at http://www.comicer.com/stronghorse/water/software/djvu2pdf.htm . And he explained how to convert DJVU to PDF loss-lessly without file size growing. The article is written in Chinese. I don't know if you are familiar with Chinese language but the main idea of that article is that conventionally the conversion from Djvu to PDF using djvulibre will first convert Djvu pages to TIFF images and then convert TIFFs to PDF pages. Because Djvu uses many advanced compression algorithms such as MRC(Mixed Raster Content) document model and JB2 compression for characters and IW44 compression for images. The conventional conversion method just ignore MRC model of the original DJVU file and use standard JPEG 2000 to compress TIFF images rendered from Djvu documents. That's why the file size of converted PDF files is usually several times larger than original Djvu files. While actually PDF document do support similar document model as MRC called Transparent Imaging Model. And PDF 1.4 supports JBig2 compression which is a counterpart of Djvu's JB2 and PDF 1.5 supports JPEG 2000 compression which is even more advanced than IW44 used by Djvu. So if different layers of Djvu page is converted accordingly to PDF layers the process could be done nearly loss-less. And the converted file size should stay almost the same.

The DjvuToy is just an implementation of the conversion method described above. And I highly recommend you to try this toy.

tigran123 commented 12 years ago

Yes, I heard about this program but I did not believe what I heard because it seemed impossible even in theory. But I would be MOST HAPPY to be proven wrong. The reason I think it is impossible is because the JBIG2 format used by DjVu is not really supported by any PDF readers, so the PDF files created using JBIG2 won't be readable anywhere. However, this was the situation 5-10 years ago. Maybe things changed a bit in 10 years... who knows... (I am very conservative and think that since 1970s (i.e. after Unix appeared) nothing really changed in computing except the creation of Linux and TeX --- the two main giants of computer technology)

Anyway, I will definitely try this IMMEDIATELY. And you have no idea how grateful I will be to you if you turn out to be right. It will be an almost "life-changing" event (but I have more than 1TB of very rare and important books in DjVu format (I did data-archiving work for the second largest library in the world and kept a "local copy" of all the books I scanned for them :) so it will take a long time to convert them all to PDF... We'll see.

Btw, on the mupdf front --- I figured out how to do this. In fact the example utility mupdfshow is capable of extracting it very simply like this:

$ mupdfshow -b file.pdf obj_num > attachment.djvu

The only issue is to figure out the correct obj_num for the object in question, but that is fairly easy too. Just looking at it in vi makes it obvious, but to do it in a program is slightly more difficult, but still doable, of course.

Anyway, I will now try DjVuToy and if it really lives up to your description then it will be... well, a miracle! You will have changed my life, as I said! :)

chrox commented 12 years ago

@tigran123 Yes. The PDF files converted from Djvu by DjvuToy is readable on Adobe reader X on Windows and Ocular on Ubuntu and of course KPV PDFReader and PDFReflow on Kindle3.

tigran123 commented 12 years ago

Ok, I converted a 70M DjVu file and the PDF file is 88M in size, i.e. has grown by 21%. But still this is much better than the original PDF file which was 103M in size (I still kept it). And the good thing about DjVuToy is that it preserved the embedded outline and converted it to PDF bookmarks (TOC) correctly.

Thank you very much, I will experiment some more and if, indeed, the average overhead is within 20-25% then it is acceptable, i.e. many of my DjVu files can be migrated to PDF and stored in Amazon storage....

But 21% size growth cannot be called "staying almost the same".

chrox commented 12 years ago

@tigran123 It would be the worst case I have heard. Just now I tested a djvu file of 6.5 MB, and the converted pdf file is 5.7 MB with no visible degeneration of visual quality. And at most times the size of converted PDF files grows by less than 10%. You should try more conversions.

tigran123 commented 12 years ago

Also, I noticed that it doesn't work on readonly mounted filesystems because it tries to create the output in the same directory as the source file. Setting the target directory is overwritten when the source is specified.

tigran123 commented 12 years ago

Maybe you are using some of options like reducing dpi to a fixed one?

chrox commented 12 years ago

@tigran123 Or you could move the djvu file to be converted to tmp directory and convert it there.

chrox commented 12 years ago

@tigran123 It also seems to lack the ability of batch process and a command line interface.

tigran123 commented 12 years ago

Well, it seems to support multiple file conversions. I haven't tried it yet. I'll do it now.

chrox commented 12 years ago

@tigran123 Oh yes. I set the fixed DPI to 96.

tigran123 commented 12 years ago

Ah, and I leave it as "document setting" and most of my DjVu files are either 600dpi or 1600dpi.

Btw, 96dpi files would be unreadable on Kindle. You should set it to at least 300dpi. (I know that Kindle's screen is 170dpi but one normally reads document by zooming in using Shift-X/Shift-S and so the effective expected resolution is about 300dpi)

A few days ago we had a denial of service attack on github.com. Now I need to download windjview from sourceforge and seems like they are under DOS attack as well...

chrox commented 12 years ago

@tigran123 But the DPI setting doesn't affect the converted file size much. After I set DPI to 300 the generated PDF file is still 5.7MB exactly the same as converting using 96 DPI.

chrox commented 12 years ago

@tigran123 According to the article of the author the file size doesn't change much after conversions if original divu pages are 1-layer or 3-layers, but if it happens to be 2-layers (with color text) the size of converted pdf file will grow a lot because there is no 2-layers model in PDF document structure it has to be converted to 3-layes. Maybe that's the reason of the size growth. And the author said that JB2 to JBig2 conversion is totally loss-less while the IW44 to JPEG 2000 conversion is not. The program will generate a JPEG stream of a roughly the same length of the original IW44 stream which you should consider when you are about to convert your whole djvu archive.

tigran123 commented 12 years ago

My DjVu files are always BITONAL with no colour (or any other) text layer. I don't believe in adding an OCRed "hidden text" for search purposes because it is always of terrible quality, so I like pure single layer images, nothing else. So, no, that is not the reason of growth in my case, as you can see if you download the djvu file I pointed to.

tigran123 commented 12 years ago

Also, I disabled the option "convert hidden text" anyway, this is for converting DjVu files created by other people (there are plenty such in my archive --- not all the good books in the world are created by myself, of course :)

tigran123 commented 12 years ago

Ok, I converted another randomly chosen file and the size ratio is 47867896/64474285 = .74243391764639189096

So, that is about 26% increase...

I'll try another totally different (and created by someone else) file and see if this 20-25% increase is the same for all files.

tigran123 commented 12 years ago

Actually, for other files it is even worse --- about 30% size increase, look:

$ l Greek-Analytical-Lexicon.*
-rw-rw-r-- 1 tigran tigran 16754562 Oct 23 15:55 Greek-Analytical-Lexicon.djvu
-rw-rw-r-- 1 tigran tigran 23349864 Oct 23 16:32 Greek-Analytical-Lexicon.pdf
$ bc -ql
16754562/23349864
.71754430775271324920

I'll try a few more, but my current conclusion is that the result of conversion by DjVuToy is about 25-30% bigger than the original DjVu file. So, it still makes sense to implement an attachment extraction facility and embed DjVu directly in PDF as attachment (this has 0.003% size increase).

And here is another, of smaller size:

$ l Souter-Pocket-Lexicon.*
-rw-rw-r-- 1 tigran tigran 3637460 Oct 23 15:55 Souter-Pocket-Lexicon.djvu
-rw-rw-r-- 1 tigran tigran 4826042 Oct 23 16:58 Souter-Pocket-Lexicon.pdf
$ bc -ql
3637460/4826042
.75371494902033591916

So, yes, 25% size increase seems to be normal for good quality properly made DjVu files. Of course I don't disbelieve what you are saying --- I simply think that all the DjVu files you tried were probably very poorly made and so DjVuToy managed to produce PDF files of the same or even smaller size.

chrox commented 12 years ago

@tigran123 I have deleted most of the DJVU files after converted them to PDF. But there are remaining 5 DJVUs for test.

DJVU size	PDF size	Increase ratio
14972100	13691001	-0.094
11152495	10269955	-0.086
9245902	10048226	0.080
6473893	5739800	-0.128
7978140	11718491	0.320

Updates: Yes. Probably file size growth will be larger for high resolution Djvu documents.

tigran123 commented 12 years ago

Yes, looks like it. (I updated previous message just before seeing this latest message of yours :)

tigran123 commented 12 years ago

Ok, I'll now dig into mupdfshow.c source code and try to extract the bits we need to save the attachment.

tigran123 commented 12 years ago

@chrox I did some minor editing to our conversation for obvious reasons. I hope you understand and don't mind :) (if not, email to me at tigran@bibles.org.uk and I'll clarify)

tigran123 commented 12 years ago

To avoid bloating KPV (and also risk instabilities like interfering with the mupdf context we maintain in pdf.c) I'll implement it as a separate utility which we can simply os.execute() to get the job done. This would be much safer than actually embedding it in KPV because otherwise we would need to slightly change the way we compile pdf.c and I don't want to risk this.

eLiNK2gl commented 12 years ago

speaking about high quality djvu (>=600 dpi). The preferred approach would to convert from djvu to pdf and apply Acrobat's ClearScan. The result (the hidden layer produced) is not perfect but you would get a nicer looking text and pdf's file size < djvu's file size.

tigran123 commented 12 years ago

Yes, I agree that for those books which deserve such level of attention and care, passing through some sort of "specks filter" would be appropriate. But the same could be done while staying with DjVu format (see cjb2 -clean option), no need to convert to PDF for this (unless Acrobat ClearScan can do a better job than cjb2 command on Linux).

In practice, those few books which would deserve such careful treatment are also the ones where we cannot afford to risk any corruption (e.g. the master 1600dpi scans of the 1955 edition of Fifth Epochal Revelation which I maintain, it is not just a "book", it is a "legal document" in a certain, unrevealed yet sense) so I fear to pass it through any filters for this reason...

eLiNK2gl commented 12 years ago

The essence of the ClearScan is creating a customized font specific to your book. Hence, the decrease in size.

tigran123 commented 12 years ago

Ah, I see. I thought it was just clearing the specks of dust that got in the way during the scanning process, like cjb2. What you are describing is FAR more dangerous, i.e. it implies OCR and I don't think we have any reliable OCR technology on this planet yet. I would never trust a text that has been produced by OCR process.

eLiNK2gl commented 12 years ago

I admitted up front the process is not free of errors. Still, the end result is not bad.

tigran123 commented 12 years ago

Ok, thank you. ClearScan... I must remember to try it sometime :)

tigran123 commented 12 years ago

My little extr tool is ready. It extracts all attachments and saves them with the correct filenames which are also embedded in the PDF file. It only processes attachment on the first page of the document, so that shall be our convention. Of course it can walk through all the pages but that would be extremely slow as it needs to load the page and examine attachments (if any). Actually, I can add an optional argument which allows one to specify which page to examine, so than one could then use it to extract "foreign" PDF files with attachments, i.e. those produced by others. I can also include, if you wish, in KPV source the simple TeX driver or even a shell script that invokes TeX to produce a PDF file with a given file embedded as attachment. And I propose to attach it to the key Alt-S in PDFReader, so the user has to open the PDF file on an appropriate page (that has attachments) and press Alt-S and KPV will execute extr /fullpath/file.pdf pageno which will create /fullpath/file.djvu (or more than one file in that same directory) extracted from page pageno.

tigran123 commented 12 years ago

closing the issue as the PR with this feature is already in the master.

tigran123 commented 12 years ago

I would like to make a few further comments in hope that these facts will be of use to someone else as well:

The latest version of MuPDF renders scanned PDF files considerably better than DjVu. This forced me to reconsider the use of DjVuToy application.
I found a rather large class of DjVu files for which DjVuToy not only does not increase but actually decreases the size by about 10-15% (e.g. 10 volumes of Landau/Lifshits Kurs, 3 volumes of Fixtengolts and many others).
When the resulting pdf files are embedded in another "container PDF" file using pdfattach utility (part of KPV) the size is decreased by another 5-6% extra. When files are embedded they are compressed and the extr utility decompresses them on the fly when extracting on the Kindle device. So, it makes sense to embed not only djvu, mp3, etc files but even PDF files, gaining three advantages: a) smaller size, b) ability to group pdf files in collections (the Amazon's Personal Documents storage is just a primitive linear list of files, no collections or directories are supported) and c) no need to reboot the Kindle after receiving the file via WiFi (and moving it to the proper directory) because the extracted files are not linked to entries in "Archived Documents" list and so you are free to move them to the correct directory without rebooting the Kindle (the old Alt-Z key to refresh the file list in the framework doesn't work anymore).

So, I would like to thank @chrox again for insisting that I look closer at DjVuToy --- as a result, a part (a relatively small, but still quite significant) of my collection of books is now being converted to PDF and I am very happy with the result --- they actually look sharper on Kindle in PDF format than in DjVu. But I must repeat: this is NOT a GENERIC statement, this is just a "there exists" (not "for every") type of statement. For most of the DjVu files which I created the result of converting via DjVuToy to PDF is about 20-25% larger.

tigran123 commented 12 years ago

I forgot to show the screenshots of djvu vs pdfold vs pdfnew, here they are:

http://www.klib.8tar.com/screenshots/foreword-djvu.bmp http://www.klib.8tar.com/screenshots/foreword-pdfnew.bmp http://www.klib.8tar.com/screenshots/foreword-pdfold.bmp

http://www.klib.8tar.com/screenshots/page183-djvu.bmp http://www.klib.8tar.com/screenshots/page183-pdfnew.bmp http://www.klib.8tar.com/screenshots/page183-pdfold.bmp

koreader / kindlepdfviewer

Can we extract attachments from a PDF? #487