kanzure / pdfparanoia

pdf watermark removal library for academic papers
https://pypi.python.org/pypi/pdfparanoia
533 stars 52 forks source link

JSTOR watermark #24

Open rcallahan opened 11 years ago

rcallahan commented 11 years ago

This content downloaded from X at T on bottom of all pages

kanzure commented 11 years ago

JSTOR has been working since 0.0.10, can you show me a sample that it fails on?

rcallahan commented 11 years ago

http://diyhpl.us/~bryan/papers2/paperbot/The%20New%20England%20Origins%20of%20Mormonism.pdf

On Thu, Mar 28, 2013 at 9:45 PM, Bryan Bishop notifications@github.comwrote:

JSTOR has been working since 0.0.10, can you show me a sample that it fails on?

— Reply to this email directly or view it on GitHubhttps://github.com/kanzure/pdfparanoia/issues/24#issuecomment-15626583 .[image: Web Bug from https://github.com/notifications/beacon/wqfBRmzxV38hApHt4ur6UsiolTJx5bYjkACsruXJ0vv7OKxH-fCMWhVyHonLgOnB.gif]

gffa commented 11 years ago

I experience the same issue at this date. Having tested several JSTOR pdfs I can not scrub the watermark from them with pdfparanoia.

fmap commented 10 years ago

The existing JSTOR scrubber stopped working because JSTOR are now adding watermarks using a different program; including more information, in a way harder to expunge.

The above patches remove watermark strings as before, but in the process, we're corrupting the file. mupdf reports:

error: cannot recognize xref format
error: cannot read xref (ofs=2290213)
error: cannot read xref at offset 2290213

Here's what I think's happening:

A PDF object can be thought of as a hierarchy of objects; the most important of these is the Root entry, which "contains references to other objects defining the document’s contents, outline, article threads, named destinations, and other attributes". In the old style generator, the index of the Root entry was found by consulting the file trailer, which was guaranteed to be at a particular position near the end of the file. With the new generator, this index is instead contained in the dictionary of a cross-reference stream, the position of which is referenced by byte offset at the end of the file.

When we remove watermarks, we're changing the length of objects within the file, breaking that reference; the offset is no longer accurate. This stops the root value from being retrieved, KABLAM!

We could solve this by, after manipulating objects within pdfparanoia.eraser, determining the new location of the xref section, and updating the offset description accordingly. I'll probably get around to this tomorrow.

fmap commented 10 years ago

Further errors, now. A sample:

error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: cannot find page -1 in page tree