Closed jpluimers closed 5 years ago
Hm, interesting idea. Unfortunately aha does not look more than one byte in the future. However I would need to find a whole "http://" or "https://" and to add the "<a href=" right before. So it is not possible for now.
But you could add a sed command between your output and aha or between aha and your piping to an html file, which formats urls like this
./testssl.sh | sed 's,\(https\?://[^ ]*\),<a href="\1">,g' | aha > testssl.sh.html
sed 's,\(https\?://[^ ]*\),<a href="\1">,gI'
is slightly better as it also understand uppercase URLs like "HTTP://WWW.XKCD.COM/208"
I'm a regex noop, so bear with me.
The backslash escapes are to ensure the final regex executed is this, right?
(https?://[^ ]*),<a href="1">,gI
Guessing the 1
is a parameter expansion, shouldn't the regex be come more like
(https?://[^ ]*),<a href="1">1</a>,gI
Well the (
has to be escaped with \
, so that sed don't just think we want to search for (https://)www.google.de
Same for ?
, which determines, that the "s" is optional.
The \1
means, that the matched string should pasted here, in that case the url. [^ ]
is every letter except space. So all in all:
sed 's,
→ start replacing\(…\)
→ Make this matched string useable as variable (later used as \1
)https\?://[^ ]*
→ Match for http or https followed by :// and an arbitrary number of letters, which are not space ([^ ]
),<a href="\1">
→ Replace found string with <a href="
followed by the found string (\1
) followed by ">
,gI'
→ Do this for every match and not only for the first one found (g
) and do it case insensitive (I
)Ah, NOW I got your question. I of course forgot to add a </a>
and url name... So the whole correct regex would be:
sed 's,\(https\?://[^ ]*\),<a href="\1">\1</a>,gI'
@theZiz your solution put me on the right track, but it didn't fully solve the problem for a few reasons:
aha
would replace the generate <
and >
characters in the anchor element with <
and >
so the regular expression would not workaha
in front of sed
I found out that on Mac OS X, the I
option is not supported: you will get a bad flag in substitute command: 'I'
when executing sed 's,\(https\?://[^ ]*\),<a href="\1">\1</a>,gI'
perl
I found out it replaced too much (as it now operated on aha
generated html
) which made even perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://[^\s]*),$1<a href="$2">$2</a>,gi'
failTo cut a long story short, here is a bash function that works and you can pipe Ansi output through:
aha-with-expanded-http-https-urls()
{
aha | perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\"\s),$1<a
}
It doesn't take into account RFC URI checking by regex as that's way too convoluted. If anyone wants that, adapt it according to the answers at http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
The biggest problem was to ensure it would skip the "
terminating an URI at the end of the line. This can be in the testssl.sh
output upon a 302-redirect
. So the solution is somewhat tailored to testssl.sh
output piped through aha
.
A lot of digging finally resulted in this expression at https://regex101.com/r/zF3zQ2/2 Note that site forgets about the ,
as search separators, but that's OK: you can use the drop-down to choose another one or paste this full expression and it will happily use the ,
separator:
s,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\"\s),$1<a href="$2">$2</a>$4,gi
Getting there, one of the things I tried was negative lookahead but that failed. I tried following the example at for instance http://stackoverflow.com/questions/11028336/regex-to-match-a-pattern-and-exclude-list-of-string
So in the above solution, I went for a non-greedy .*?
expression followed by matching either whitespace or the "
followed by whitespace.
These are the separator, search and modifier part of the above expression:
,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\"\s),gi
Note the 2nd
capturing group cannot do without the 3rd
in order to match multiple protocols.
This is how it's assembled:
1st
Capturing group ([^"])
[^"]
match a single character not present in the list below"
a single character in the list "
literally (case insensitive)2nd
Capturing group ((https?|s?ftp|ftps?|file)://.*?)
3rd
Capturing group (https?|s?ftp|ftps?|file)
1st
Alternative: https?
http
matches the characters http
literally (case insensitive)s?
matches the character s
literally (case insensitive)
?
Between zero
and one
time, as many times as possible, giving back as needed [greedy]
2nd
Alternative: s?ftp
s?
matches the character s
literally (case insensitive)?
Between zero
and one
time, as many times as possible, giving back as needed [greedy]
ftp
matches the characters ftp
literally (case insensitive)3rd
Alternative: ftps?
ftp
matches the characters ftp
literally (case insensitive)s?
matches the character s
literally (case insensitive)
?
Between zero
and one
time, as many times as possible, giving back as needed [greedy]
4th
Alternative: file
file
matches the characters file literally (case insensitive)://
matches the characters ://
literally.*?
matches any character (except newline)*?
Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
4th
Capturing group ([\s]|\")
1st
Alternative: [\s]
[\s]
match a single character present in the list below
\s
match any white space character [\r\n\t\f ]
2nd
Alternative: \"\s
\&
matches the character&
literallyquot;
matches the characters quot;
literally (case insensitive)\s
match any white space character [\r\n\t\f ]
g
modifier: global. All matches (don't return on first match)i
modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z]
)For replacement it's important to ensure all unique capturing groups end up in the output. Which means you can skip $3
(as it's part of $2
) but have to include the others.
Which gets me to the replacement part of the expression:
$1<a href="$2">$2</a>$4
Test input:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. http://ziz.delphigl.com/tool_aha.php -->
<html xmlns="http://www.w3.org/1999/xhtml">
testssl.sh 2.7dev from https://testssl.sh/dev/
<span style="font-weight:bold;"> OCSP URI </span>http://clients1.google.com/ocsp
<span style="font-weight:bold;"> HTTP Status Code </span> 302 Found, redirecting to "https://www.google.nl/?gfe_rd=cr&ei=ZWjmV86hE5LH8AeFmaP4Bg"
Test output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. <a href="http://ziz.delphigl.com/tool_aha.php">http://ziz.delphigl.com/tool_aha.php</a> -->
<html xmlns="http://www.w3.org/1999/xhtml">
testssl.sh 2.7dev from <a href="https://testssl.sh/dev/">https://testssl.sh/dev/</a>
<span style="font-weight:bold;"> OCSP URI </span><a href="http://clients1.google.com/ocsp">http://clients1.google.com/ocsp</a>
<span style="font-weight:bold;"> HTTP Status Code </span> 302 Found, redirecting to "<a href="https://www.google.nl/?gfe_rd=cr&ei=ZWjmV86hE5LH8AeFmaP4Bg">https://www.google.nl/?gfe_rd=cr&ei=ZWjmV86hE5LH8AeFmaP4Bg</a>"
Test matches:
MATCH 1
1. [168-169] ` `
2. [169-205] `http://ziz.delphigl.com/tool_aha.php`
3. [169-173] `http`
4. [205-206] ` `
MATCH 2
1. [286-287] ` `
2. [287-310] `https://testssl.sh/dev/`
3. [287-292] `https`
4. [310-311] `
`
MATCH 3
1. [379-380] `>`
2. [380-411] `http://clients1.google.com/ocsp`
3. [380-384] `http`
4. [411-412] `
`
MATCH 4
1. [512-513] `;`
2. [513-575] `https://www.google.nl/?gfe_rd=cr&ei=ZWjmV86hE5LH8AeFmaP4Bg`
3. [513-518] `https`
4. [575-582] `"
`
Boy, that was a long comment (:
--jeroen
Boy, that was a long comment :)
Thanks for the detailed clarification and I am very glad, that it works in the end so well for you. :D
All the more reason to re-open this issue as a low-prio one as the "
handling is really something very awful: it assumes how AHA works and I'm pretty sure there are cases this handling won't work.
I will have a look at this
I didn't come up with a solution the last 3 years, so I guess, this can be finally closed. But thanks again for your nice regex work, jpluimers.
See the dump at https://gist.github.com/6d536c6ed8af20bcacb0d89077101f41
When you look at the rendered html https://rawgit.com/jpluimers/6d536c6ed8af20bcacb0d89077101f41/raw/19bbaad542764ce5f99fce66ed313dc9caf2e834/testssl.sh.html, you see these URLs are not
a href
tags:It would be cool if they did (but I understand this would make parsing more than a tad complexer).