hefronmedia / pdfsizeopt

Automatically exported from code.google.com/p/pdfsizeopt
0 stars 0 forks source link

Remove the info fields #77

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In an attempt to use pdfsizeopt to find a "normalized" or "canonical" 
representation of PDF files for potential deduplication during backups (or even 
for the sake of privacy), it would be nice if pdfsizeopt allowed one to also 
remove metadata such as the user customizable fields that appear when pdfinfo 
is invoked with a PDF file like:

Title:          The bytefield package
Subject:        Protocol diagrams for LaTeX
Keywords:       bits, bytes, bit fields, communication, network protocol 
diagrams, LaTeX2e, memory maps
Author:         Scott Pakin <scott+bf@pakin.org>
Creator:        LaTeX with hyperref package
Producer:       pdfTeX-1.40.10
CreationDate:   Sun Sep  2 13:50:50 2012
ModDate:        Sun Sep  2 13:50:50 2012
Tagged:         no
Pages:          48
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      724524 bytes
Optimized:      yes
PDF version:    1.4

Especially the dates. When would this "normalization" be desirable?

For instance, I sometimes (actually, frequently) find PS files, download them 
(perhaps in multiple computers, when I have to stop what I am reading and have 
to unse another computer) which I happen to convert to PDF since not all 
environments that I use may have an adequate PS reader.

When I want to backup things, it would be nice to be able to run a program like 
hardlink, or fdupes, or rdfind, or duff etc. to choose which copies I keep and 
which copies I don't.

It would also make it easier for deduplicating backup tools (like obnam or bup) 
to save space in such circumstances.

Regards,

Rogério Brito.

Original issue reported on code.google.com by rbr...@gmail.com on 26 Feb 2013 at 11:13

GoogleCodeExporter commented 9 years ago
Yes, it would be a nice and simple new pdfsizeopt feature to remove the info 
fields Title, Subject, Keywords, Author, Creator, Producer, CreationDate and 
ModDate.

pdfsizeopt makes no attempt to generate a normalized or canonical output 
representation. In my opinion, this feature would be very complicated to 
implement, maybe close to impossible to implement it in a usable way. Thus I 
have no such plans. From now on this issue will track only the removal of the 
info fields.

Original comment by pts...@gmail.com on 27 Feb 2013 at 9:53

GoogleCodeExporter commented 9 years ago
That's OK with me.

If you want to split this issue in two for tracking purposes (remove the info 
fields and, perhaps, in the future, try to make something canonical), feel 
free. Or if you want, I can do that (but I lack the privileges, I think).

Original comment by rbr...@gmail.com on 27 Feb 2013 at 10:05