raviqqe / muffet

Fast website link checker in Go
MIT License
2.47k stars 95 forks source link

Ignore data:image URLs instead of printing them in Muffet's output #345

Closed brandon-rhodes closed 8 months ago

brandon-rhodes commented 8 months ago

Some of my blog posts and talk slides are generated from Jupyter notebooks, so they embed images inside of URLs instead of generating external files. A URL in the source of one such page looks like:

<img src="
AAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl01PW9//HXd+b7nWzsEsANBEtAK1ZEQQUuyCLgyiKb
QCb3/kDg2rp10etVqq3Vqpdz3aqiXmUCERAFwQXqijsqKgoWtCJiUQsBoZCFfNffH2OhtBJgGPJN
Zp6Pc3pOmEwyr9N3SV798JnPxwiCIBAAAACQ4SJhBwAAAADqAsUXAAAAWYHiCwAAgKxA8QUAAEBW
...
ywAAAABJRU5ErkJggg==
"
>

Each of these results in Muffet output that looks like:

        parse "...Jggg==": net/url: invalid control character in URL ...
...
li4uLi5Z4Iqli4uLSxa4Yuni4uKSBa5Yuri4uGSBK5YuLi4uWeCKpYuLi0sW/B9G21WYO+IUAAAA
AUlEQVQVReWZAwAAAABJRU5ErkJggg==

—thus including the entire base64 representation of the image twice, once without newlines and then once with newlines.

I tried adding the rule --exclude='^data:image' but it does not seem to have had any effect on Muffet's behavior.

Could Muffet be taught—preferably without even needing a new command-line option, as data: URLs never lead anywhere—to ignore such URLs completely? Alternately, it would be sufficient for the --exclude option to also skip the URL parsing step. Alternatively, Muffet could learn to accept URLs with newlines in them, which would also make the error message go away.

Thanks for muffet, I otherwise have gotten great output from the tool and it has been very useful in maintaining sites full of links of various ages and vintages!

brandon-rhodes commented 8 months ago

For anyone else with this problem, here is a quick Python 3 script that removes the data: URL error messages and content:

#!/usr/bin/env python3

import fileinput
import sys

lines = iter(fileinput.input())
for line in lines:
    if not line.startswith('    parse "data:'):
        sys.stdout.write(line)
        continue
    line = line.strip('\n')
    url = line.split('"')[1]
    url_lines = iter(url.split('\\n'))
    url_first_line = next(url_lines)
    assert line.endswith(url_first_line)
    for url_next_line in url_lines:
        line = next(lines).strip('\n')
        assert line == url_next_line

It uses assert to make sure the lines it's ignoring from the Muffet output are precisely the ones it expects from the content of the data: URL.

raviqqe commented 8 months ago

The following HTML file works with python3 -m http.server.

<body>
  <a href="https://raviqqe.com" />
  <a href="" />
</body>

The error message says net/url: invalid control character in URL. Are there any unexpected characters in the URLs?

Also, actually in the example page, I don't see any data URLs. Can you provide one with them?

brandon-rhodes commented 8 months ago

Thanks for the quick response!

The error message says net/url: invalid control character in URL. Are there any unexpected characters in the URLs?

My guess is that the newline character is the one that the error message is complaining about — despite its being a valid and acceptable character in base64 blocks of data.

Also, actually in the example page, I don't see any data URLs. Can you provide one with them?

Drat, I must have confused which URL was in my clipboard. Apologies! I have updated the comment to link to this page of conference slides:

https://rhodesmill.org/brandon/slides/2018-08-pybay/slides.html

raviqqe commented 8 months ago

@brandon-rhodes Can you test v2.9.3?

brandon-rhodes commented 8 months ago

@raviqqe — Amazing!

$ ls -l OUT*
-rw-r--r-- 1 root    root    1448966 Oct 29 14:40 OUT-before
-rw-r--r-- 1 root    root      19680 Nov 25 15:40 OUT-after

The output of a run against my website is now only 1.4% the size of the previous run, and every single line is real data which is informative.

Thanks very much for tackling my edge case, and for offering muffet in the first place!