kanzure / pdfparanoia

pdf watermark removal library for academic papers
https://pypi.python.org/pypi/pdfparanoia
528 stars 52 forks source link

Watermark detection (but not removal) #29

Open kanzure opened 11 years ago

kanzure commented 11 years ago

Make a way to detect whether or not a document is likely to have a watermark. There are a few different ways of detection that I can imagine:

Knowing that there is a watermark present is really helpful, because it means that you can track which percent of your collection is watermarked. Other tools can make informed decisions about what to do with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an unknown unknown.

ghost commented 11 years ago

Another, half baked thought: Image-ify all pages, discard whitespace, xor against front page or other reference standard to identify pixels that do not vary across pages: this is either margin decoration or a common watermark.

Bryan Bishop notifications@github.com wrote:

Make a way to detect whether or not a document is likely to have a watermark. There are a few different ways of detection that I can imagine:

  • analyzing a pdf for text that looks like a watermark
  • render pdf to png then analyze the margins for blocks that probably have ip addresses, especially if these blocks are repeated on each page
  • when given a pdf and its source url, have a pre-seeded table of information about whether or not that specific publisher tends to add watermarks
  • given a pdf with no url, have some routines for detecting whether or not the paper was published by Elsevier, Springer, IEEE, or whoever, and then find that publisher in a lookup table to determine whether or not the pdf probably has a watermark

Knowing that there is a watermark present is really helpful, because it means that you can track which percent of your collection is watermarked. Other tools can make informed decisions about what to do with a paper if there is a known watermark.

Unknown watermarks are the worst, but there's no way to detect an unknown unknown.


Reply to this email directly or view it on GitHub: https://github.com/kanzure/pdfparanoia/issues/29

Sent from my Android device with K-9 Mail. Please excuse my brevity.

kanzure commented 11 years ago

Cool, but how do you get rid of those elements? You would have to randomly delete pdf elements until the resulting pngs didn't have those images. Might work. Also, this technique would accidentally remove journal titles in margins, which is bad, but okay if there is JSON metadata that is attached to the pdf somehow.

ghost commented 11 years ago

Unless you brute-force attempted to delete each individual element, I figure it's just a rapid filter to helo detect watermarks. Of course, brute force deletion might assist in creating a pdfparanoia profile for a new publisher, so perhaps the once-off inefficiency would prove worthwhile.

A straight xor would only work if each watermark instance was binary-identical to the next. With any image compression this would likely fail, so perhaps a less stringent comparison, seeking bytes/pixels that vary less than a certain threshold, discarding X outliers based on pagecount..?

Bryan Bishop notifications@github.com wrote:

Cool, but how do you get rid of those elements? You would have to randomly delete pdf elements until the resulting pngs didn't have those images. Might work. Also, this technique would accidentally remove journal titles in margins, which is bad, but okay if there is JSON metadata that is attached to the pdf somehow.


Reply to this email directly or view it on GitHub: https://github.com/kanzure/pdfparanoia/issues/29#issuecomment-20622140

Sent from my Android device with K-9 Mail. Please excuse my brevity.