PDF-XChange Editor watermarks leave non-visible URL links in top left/right corners

warren-bank commented 6 years ago

notes:

this occurs when removing the watermark from a document with compatibility level >= PDF 1.5
- below this level:
- links are in clear text and easily filtered
- above this level:
- links are zlib compressed within FlateDecode objects
- the search pattern removes the visible watermark, but leaves behind non-visible URL links

workaround 1:

try using the command-line option: --unzip
- bash:
  ./filter.sh --unzip
- Windows:
  call filter.bat --unzip
open out.pdf in a PDF Viewer
- confirm that the non-visible URL links are gone
- confirm that everything else is intact
- filtered objects (zlib compressed or otherwise) can contain more document elements than you may wish to remove
- resave document to clean up removed/missing objects
- File > Save Copy As..
- File > Close

workaround 2:

open the pdf document in a text editor
change the first line to: %PDF-1.4
save
open the pdf document in PDF-XChange Editor
- File > Save as Optimized:
- Make Compatible With: Version 1.4
- No Downsampling
- Compression: Retain existing
- File > Close All
rename pdf document to: in.pdf
run: filter.pl
open out.pdf in a PDF Viewer
- confirm that the non-visible URL links are gone
- confirm that everything else is intact
- forcefully lowering the compatibility level can cause some features to break
- resave document to clean up removed/missing objects
- File > Save Copy As..
- File > Close

mathikas commented 2 weeks ago

PDF-XChange has a free built-in ALL links removal without purchasing plus version. You can do that in tab Home -> Links -> Remove all web-links.

warren-bank commented 2 weeks ago

@mathikas

off-hand, I'm not sure whether you're referring to PDF-XChange Viewer or PDF-XChange Editor
this perl script is intended for use with the latter.. the editor
- I'm personally still using version 7.0.324.. simply because I haven't bothered to update
- this freeware editor allows full use of all premium features, but when the final document is saved.. it adds a watermark to all 4 corners of every page
- this perl script is used as a filter to remove these watermarks
- editor saves file1.pdf with watermarks
- perl script reads file1.pdf (renamed to in.pdf) and outputs out.pdf without watermarks
- typically, I would then open out.pdf in the (aforementioned) viewer program, and resave it as file2.pdf
  - this extra step just produces a cleaner file without any format warnings or errors
- at this point, file2.pdf is the final output.. and all other files (file1.pdf, in.pdf, out.pdf) can be deleted
I wrote it ages ago.. but I still use it whenever I have a pdf that needs editing

mathikas commented 2 weeks ago

Yes, I use this script to remove the watermarks added by the free version of PDF-XChange Editor after using its premium features. The script works pretty well, but it leaves behind invisible links to the PDF-XChange website in the top left and right corners.

I saw your above solutions, but unfortunately, none of them worked for me. However, I discovered that after run the script to remove visible watermark, I use the built-in "Remove all web-links" feature (that doesn't require plus/premium, so it won't add another watermark) to remove all the remaining invisible links. This results in a very clean PDF, tested on the newest version.

The only downside is that it also removes any other links that may be present in the document. To address this, I manually remove the PDF-XChange website link using "add/edit links" feature (also free), which can be time-consuming.

warren-bank commented 2 weeks ago

ohhh, ok.. now I understand your original comment. I wasn't sure how familiar you were with this repo.. what the script does.. etc. You're right then.. for a pdf that doesn't have any other web links, this feature would be a quick and easy solution. Thanks for sharing.

I'm amazed that this script still works on the watermarks added by the current release of the editor. I haven't updated in years, and would've expected that the watermarks had changed in some significant way after all that time that would prevent the script from being able to detect and remove them.

warren-bank commented 2 weeks ago

admittedly, using a 2nd tool to cleanup after the 1st tool is janky.. but I have to admit that PyMuPDF is powerful!

personally, I'm not a Python guy.. I have an old version handy, but it's too old to use to test PyMuPDF.. and I don't feel like updating.

that said, PyMuPDF has an online web console that works great.. and can be used without the need to install anything.

here is a script that I wrote:

"""
https://github.com/pymupdf/PyMuPDF
https://pymupdf.io/

https://pymupdf.readthedocs.io/en/latest/document.html
https://pymupdf.readthedocs.io/en/latest/document.html#Document.scrub
https://pymupdf.readthedocs.io/en/latest/document.html#Document.pages
https://pymupdf.readthedocs.io/en/latest/document.html#Document.save
https://pymupdf.readthedocs.io/en/latest/document.html#Document.tobytes
https://pymupdf.readthedocs.io/en/latest/page.html
https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_links
https://pymupdf.readthedocs.io/en/latest/page.html#Page.delete_link
https://pymupdf.readthedocs.io/en/latest/link.html
https://pymupdf.readthedocs.io/en/latest/link.html#Link.uri

https://pyodide.org/en/stable/usage/quickstart.html#accessing-javascript-scope-from-python
"""

watermark = "https://www.tracker-software.com/product/pdf-xchange-editor"
is_online_web_console = True
debug_log = True
do_scrub = False

if debug_log:
    print('All links before removal:')
    for page in doc.pages():
        print(f'Page: {page.number}')
        for link in page.get_links():
            if 'uri' in link:
                print(f'Link: {link.get("uri")}')

for page in doc.pages():
    for link in page.get_links():
        if 'uri' in link and link.get('uri') == watermark:
            page.delete_link(link)

if debug_log:
    print('All links after removal:')
    for page in doc.pages():
        print(f'Page: {page.number}')
        for link in page.get_links():
            if 'uri' in link:
                print(f'Link: {link.get("uri")}')

if do_scrub:
    doc.scrub(attached_files=False, clean_pages=False, embedded_files=False, hidden_text=True, javascript=True, metadata=True, redactions=False, redact_images=0, remove_links=False, reset_fields=False, reset_responses=False, thumbnails=True, xml_metadata=True)

if not is_online_web_console:
    doc.save('out.pdf', garbage=3, deflate=True)
else:
    import base64
    import js
    bytes = doc.tobytes(garbage=3, deflate=True)
    data_uri = 'data:application/octet-stream;base64,' + base64.b64encode(bytes).decode('ascii')
    div = js.document.createElement('div')
    div.innerHTML = '<a href="' + data_uri + '" download="out.pdf">Download modified PDF file</a>'
    js.document.body.prepend(div)

open the online web console
click the File Uploader button: Open
- select the input PDF file path
enter into the PyMuPDF Terminal the content of the above python script
scroll to the top of the page
click the newly prepended link: Download modified PDF file
- select the output PDF file path
- this file will no-longer contains the invisible watermark links

warren-bank commented 2 weeks ago

I went down a bit of a rabbit hole.. I was curious if PyMuPDF could also be used to remove the watermark images.

The short answer is: no.

The longer answer is...

for page in doc.pages():
    print(f'Page: {page.number}')
    for drawing in page.get_drawings():
        print(f'Drawing: {drawing}')

"""
Observations:
  * 8 drawings per page
    - 2 watermarks per page
    - 4 drawings per watermark
  * every page has the same 8 drawings,
    and all share the same sets of Rect(x, y, x, y) coordinates
  * each of the 8 Rect() coordinates seems to always be paired to the same unique sequence number
"""

watermark_coords = {
  "0": [-0.0010000000474974513, -0.0009918212890625, 73.87300109863281, 73.87300872802734],
  "1": [6.480000019073486, 6.480010986328125, 67.39199829101562, 67.39200592041016],
  "2": [13.607999801635742, 13.608009338378906, 60.263999938964844, 60.264007568359375],
  "3": [17.82699966430664, 17.827011108398438, 56.04499816894531, 56.04500961303711],
  "6": [521.402587890625, -0.0009918212890625, 595.2765502929688, 73.87300872802734],
  "7": [527.883544921875, 6.480010986328125, 588.7955322265625, 67.39200592041016],
  "8": [535.0115356445312, 13.608009338378906, 581.6675415039062, 60.264007568359375],
  "9": [539.2305908203125, 17.827011108398438, 577.4485473632812, 56.04500961303711]
}

found = 0

for page in doc.pages():
    print(f'Page: {page.number}')
    for drawing in page.get_drawings():
        if 'seqno' in drawing and 'rect' in drawing and str(drawing.get('seqno')) in watermark_coords:
            rect = drawing.get('rect')
            mark = watermark_coords.get(str(drawing.get('seqno')))
            if ([rect.x0, rect.y0, rect.x1, rect.y1] == mark):
                print(f'found watermark drawing: {drawing}')
                found += 1

print('')
print(f'found {found} watermark drawings')

"""
https://github.com/pymupdf/PyMuPDF/issues/847
  issue: there is no way to remove the found watermark drawings
"""

warren-bank / remove-common-pdf-watermarks

PDF-XChange Editor watermarks leave non-visible URL links in top left/right corners #4