Closed brandon-rhodes closed 8 months ago
For anyone else with this problem, here is a quick Python 3 script that removes the data:
URL error messages and content:
#!/usr/bin/env python3
import fileinput
import sys
lines = iter(fileinput.input())
for line in lines:
if not line.startswith(' parse "data:'):
sys.stdout.write(line)
continue
line = line.strip('\n')
url = line.split('"')[1]
url_lines = iter(url.split('\\n'))
url_first_line = next(url_lines)
assert line.endswith(url_first_line)
for url_next_line in url_lines:
line = next(lines).strip('\n')
assert line == url_next_line
It uses assert
to make sure the lines it's ignoring from the Muffet output are precisely the ones it expects from the content of the data:
URL.
The following HTML file works with python3 -m http.server
.
<body>
<a href="https://raviqqe.com" />
<a href="data:image/png;base64,deadbeef" />
</body>
The error message says net/url: invalid control character in URL
. Are there any unexpected characters in the URLs?
Also, actually in the example page, I don't see any data URLs. Can you provide one with them?
Thanks for the quick response!
The error message says net/url: invalid control character in URL. Are there any unexpected characters in the URLs?
My guess is that the newline character is the one that the error message is complaining about — despite its being a valid and acceptable character in base64 blocks of data.
Also, actually in the example page, I don't see any data URLs. Can you provide one with them?
Drat, I must have confused which URL was in my clipboard. Apologies! I have updated the comment to link to this page of conference slides:
https://rhodesmill.org/brandon/slides/2018-08-pybay/slides.html
@brandon-rhodes Can you test v2.9.3?
@raviqqe — Amazing!
$ ls -l OUT*
-rw-r--r-- 1 root root 1448966 Oct 29 14:40 OUT-before
-rw-r--r-- 1 root root 19680 Nov 25 15:40 OUT-after
The output of a run against my website is now only 1.4% the size of the previous run, and every single line is real data which is informative.
Thanks very much for tackling my edge case, and for offering muffet in the first place!
Some of my blog posts and talk slides are generated from Jupyter notebooks, so they embed images inside of URLs instead of generating external files. A URL in the source of one such page looks like:
Each of these results in Muffet output that looks like:
—thus including the entire base64 representation of the image twice, once without newlines and then once with newlines.
I tried adding the rule
--exclude='^data:image'
but it does not seem to have had any effect on Muffet's behavior.Could Muffet be taught—preferably without even needing a new command-line option, as
data:
URLs never lead anywhere—to ignore such URLs completely? Alternately, it would be sufficient for the--exclude
option to also skip the URL parsing step. Alternatively, Muffet could learn to accept URLs with newlines in them, which would also make the error message go away.Thanks for muffet, I otherwise have gotten great output from the tool and it has been very useful in maintaining sites full of links of various ages and vintages!