warren-bank / remove-common-pdf-watermarks

Remove common PDF watermarks (with perl)
GNU General Public License v2.0
5 stars 1 forks source link

PDF-XChange Editor watermarks leave non-visible URL links in top left/right corners #4

Open warren-bank opened 6 years ago

warren-bank commented 6 years ago

notes:

workaround 1:

workaround 2:

mathikas commented 2 weeks ago

PDF-XChange has a free built-in ALL links removal without purchasing plus version. You can do that in tab Home -> Links -> Remove all web-links.

warren-bank commented 2 weeks ago

@mathikas

mathikas commented 2 weeks ago

Yes, I use this script to remove the watermarks added by the free version of PDF-XChange Editor after using its premium features. The script works pretty well, but it leaves behind invisible links to the PDF-XChange website in the top left and right corners.

I saw your above solutions, but unfortunately, none of them worked for me. However, I discovered that after run the script to remove visible watermark, I use the built-in "Remove all web-links" feature (that doesn't require plus/premium, so it won't add another watermark) to remove all the remaining invisible links. This results in a very clean PDF, tested on the newest version.

The only downside is that it also removes any other links that may be present in the document. To address this, I manually remove the PDF-XChange website link using "add/edit links" feature (also free), which can be time-consuming.

warren-bank commented 2 weeks ago

ohhh, ok.. now I understand your original comment. I wasn't sure how familiar you were with this repo.. what the script does.. etc. You're right then.. for a pdf that doesn't have any other web links, this feature would be a quick and easy solution. Thanks for sharing.

I'm amazed that this script still works on the watermarks added by the current release of the editor. I haven't updated in years, and would've expected that the watermarks had changed in some significant way after all that time that would prevent the script from being able to detect and remove them.

warren-bank commented 2 weeks ago

admittedly, using a 2nd tool to cleanup after the 1st tool is janky.. but I have to admit that PyMuPDF is powerful!

personally, I'm not a Python guy.. I have an old version handy, but it's too old to use to test PyMuPDF.. and I don't feel like updating.

that said, PyMuPDF has an online web console that works great.. and can be used without the need to install anything.

here is a script that I wrote:

"""
https://github.com/pymupdf/PyMuPDF
https://pymupdf.io/

https://pymupdf.readthedocs.io/en/latest/document.html
https://pymupdf.readthedocs.io/en/latest/document.html#Document.scrub
https://pymupdf.readthedocs.io/en/latest/document.html#Document.pages
https://pymupdf.readthedocs.io/en/latest/document.html#Document.save
https://pymupdf.readthedocs.io/en/latest/document.html#Document.tobytes
https://pymupdf.readthedocs.io/en/latest/page.html
https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_links
https://pymupdf.readthedocs.io/en/latest/page.html#Page.delete_link
https://pymupdf.readthedocs.io/en/latest/link.html
https://pymupdf.readthedocs.io/en/latest/link.html#Link.uri

https://pyodide.org/en/stable/usage/quickstart.html#accessing-javascript-scope-from-python
"""

watermark = "https://www.tracker-software.com/product/pdf-xchange-editor"
is_online_web_console = True
debug_log = True
do_scrub = False

if debug_log:
    print('All links before removal:')
    for page in doc.pages():
        print(f'Page: {page.number}')
        for link in page.get_links():
            if 'uri' in link:
                print(f'Link: {link.get("uri")}')

for page in doc.pages():
    for link in page.get_links():
        if 'uri' in link and link.get('uri') == watermark:
            page.delete_link(link)

if debug_log:
    print('All links after removal:')
    for page in doc.pages():
        print(f'Page: {page.number}')
        for link in page.get_links():
            if 'uri' in link:
                print(f'Link: {link.get("uri")}')

if do_scrub:
    doc.scrub(attached_files=False, clean_pages=False, embedded_files=False, hidden_text=True, javascript=True, metadata=True, redactions=False, redact_images=0, remove_links=False, reset_fields=False, reset_responses=False, thumbnails=True, xml_metadata=True)

if not is_online_web_console:
    doc.save('out.pdf', garbage=3, deflate=True)
else:
    import base64
    import js
    bytes = doc.tobytes(garbage=3, deflate=True)
    data_uri = 'data:application/octet-stream;base64,' + base64.b64encode(bytes).decode('ascii')
    div = js.document.createElement('div')
    div.innerHTML = '<a href="' + data_uri + '" download="out.pdf">Download modified PDF file</a>'
    js.document.body.prepend(div)
  1. open the online web console
  2. click the File Uploader button: Open
    • select the input PDF file path
  3. enter into the PyMuPDF Terminal the content of the above python script
  4. scroll to the top of the page
  5. click the newly prepended link: Download modified PDF file
    • select the output PDF file path
    • this file will no-longer contains the invisible watermark links
warren-bank commented 2 weeks ago

I went down a bit of a rabbit hole.. I was curious if PyMuPDF could also be used to remove the watermark images.

The short answer is: no.

The longer answer is...

for page in doc.pages():
    print(f'Page: {page.number}')
    for drawing in page.get_drawings():
        print(f'Drawing: {drawing}')

"""
Observations:
  * 8 drawings per page
    - 2 watermarks per page
    - 4 drawings per watermark
  * every page has the same 8 drawings,
    and all share the same sets of Rect(x, y, x, y) coordinates
  * each of the 8 Rect() coordinates seems to always be paired to the same unique sequence number
"""

watermark_coords = {
  "0": [-0.0010000000474974513, -0.0009918212890625, 73.87300109863281, 73.87300872802734],
  "1": [6.480000019073486, 6.480010986328125, 67.39199829101562, 67.39200592041016],
  "2": [13.607999801635742, 13.608009338378906, 60.263999938964844, 60.264007568359375],
  "3": [17.82699966430664, 17.827011108398438, 56.04499816894531, 56.04500961303711],
  "6": [521.402587890625, -0.0009918212890625, 595.2765502929688, 73.87300872802734],
  "7": [527.883544921875, 6.480010986328125, 588.7955322265625, 67.39200592041016],
  "8": [535.0115356445312, 13.608009338378906, 581.6675415039062, 60.264007568359375],
  "9": [539.2305908203125, 17.827011108398438, 577.4485473632812, 56.04500961303711]
}

found = 0

for page in doc.pages():
    print(f'Page: {page.number}')
    for drawing in page.get_drawings():
        if 'seqno' in drawing and 'rect' in drawing and str(drawing.get('seqno')) in watermark_coords:
            rect = drawing.get('rect')
            mark = watermark_coords.get(str(drawing.get('seqno')))
            if ([rect.x0, rect.y0, rect.x1, rect.y1] == mark):
                print(f'found watermark drawing: {drawing}')
                found += 1

print('')
print(f'found {found} watermark drawings')

"""
https://github.com/pymupdf/PyMuPDF/issues/847
  issue: there is no way to remove the found watermark drawings
"""