Open rcallahan opened 11 years ago
JSTOR has been working since 0.0.10, can you show me a sample that it fails on?
http://diyhpl.us/~bryan/papers2/paperbot/The%20New%20England%20Origins%20of%20Mormonism.pdf
On Thu, Mar 28, 2013 at 9:45 PM, Bryan Bishop notifications@github.comwrote:
JSTOR has been working since 0.0.10, can you show me a sample that it fails on?
— Reply to this email directly or view it on GitHubhttps://github.com/kanzure/pdfparanoia/issues/24#issuecomment-15626583 .[image: Web Bug from https://github.com/notifications/beacon/wqfBRmzxV38hApHt4ur6UsiolTJx5bYjkACsruXJ0vv7OKxH-fCMWhVyHonLgOnB.gif]
I experience the same issue at this date. Having tested several JSTOR pdfs I can not scrub the watermark from them with pdfparanoia.
The existing JSTOR scrubber stopped working because JSTOR are now adding watermarks using a different program; including more information, in a way harder to expunge.
The above patches remove watermark strings as before, but in the process, we're
corrupting the file. mupdf
reports:
error: cannot recognize xref format
error: cannot read xref (ofs=2290213)
error: cannot read xref at offset 2290213
Here's what I think's happening:
A PDF object can be thought of as a hierarchy of objects; the most important of these is the Root entry, which "contains references to other objects defining the document’s contents, outline, article threads, named destinations, and other attributes". In the old style generator, the index of the Root entry was found by consulting the file trailer, which was guaranteed to be at a particular position near the end of the file. With the new generator, this index is instead contained in the dictionary of a cross-reference stream, the position of which is referenced by byte offset at the end of the file.
When we remove watermarks, we're changing the length of objects within the file, breaking that reference; the offset is no longer accurate. This stops the root value from being retrieved, KABLAM!
We could solve this by, after manipulating objects within pdfparanoia.eraser
,
determining the new location of the xref section, and updating the offset
description accordingly. I'll probably get around to this tomorrow.
Further errors, now. A sample:
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: cannot find page -1 in page tree
This content downloaded from X at T on bottom of all pages