It would be nice if `aha` could render URLs as `a href`

jpluimers commented 8 years ago

See the dump at https://gist.github.com/6d536c6ed8af20bcacb0d89077101f41

When you look at the rendered html https://rawgit.com/jpluimers/6d536c6ed8af20bcacb0d89077101f41/raw/19bbaad542764ce5f99fce66ed313dc9caf2e834/testssl.sh.html, you see these URLs are not a href tags:

https://testssl.sh/dev/
https://testssl.sh/bugs/
Certificate Revocation List http://crl.startssl.com/sca-server1.crl
OCSP URI http://ocsp.startssl.com
https://censys.io/ipv4?q=A82B73476774DDD484D14E98466C5E36C804BFD8F3C24D7BD60FD75AF3ECF707

It would be cool if they did (but I understand this would make parsing more than a tad complexer).

theZiz commented 8 years ago

Hm, interesting idea. Unfortunately aha does not look more than one byte in the future. However I would need to find a whole "http://" or "https://" and to add the "<a href=" right before. So it is not possible for now.

But you could add a sed command between your output and aha or between aha and your piping to an html file, which formats urls like this ./testssl.sh | sed 's,$https\?://[^ ]*$,<a href="\1">,g' | aha > testssl.sh.html

theZiz commented 8 years ago

sed 's,$https\?://[^ ]*$,<a href="\1">,gI' is slightly better as it also understand uppercase URLs like "HTTP://WWW.XKCD.COM/208"

jpluimers commented 8 years ago

I'm a regex noop, so bear with me.

The backslash escapes are to ensure the final regex executed is this, right?

(https?://[^ ]*),<a href="1">,gI

Guessing the 1 is a parameter expansion, shouldn't the regex be come more like

(https?://[^ ]*),<a href="1">1</a>,gI

theZiz commented 8 years ago

Well the ( has to be escaped with \, so that sed don't just think we want to search for (https://)www.google.de Same for ?, which determines, that the "s" is optional. The \1 means, that the matched string should pasted here, in that case the url. [^ ] is every letter except space. So all in all:

sed 's, → start replacing
$…$ → Make this matched string useable as variable (later used as \1)
https\?://[^ ]* → Match for http or https followed by :// and an arbitrary number of letters, which are not space ([^ ])
,<a href="\1"> → Replace found string with <a href=" followed by the found string (\1) followed by ">
,gI' → Do this for every match and not only for the first one found (g) and do it case insensitive (I)

theZiz commented 8 years ago

Ah, NOW I got your question. I of course forgot to add a </a> and url name... So the whole correct regex would be: sed 's,$https\?://[^ ]*$,<a href="\1">\1</a>,gI'

jpluimers commented 8 years ago

@theZiz your solution put me on the right track, but it didn't fully solve the problem for a few reasons:

aha would replace the generate < and > characters in the anchor element with < and > so the regular expression would not work
after moving aha in front of sed I found out that on Mac OS X, the I option is not supported: you will get a bad flag in substitute command: 'I' when executing sed 's,$https\?://[^ ]*$,<a href="\1">\1</a>,gI'
after an initial port of the regular expression replacement to perl I found out it replaced too much (as it now operated on aha generated html) which made even perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://[^\s]*),$1<a href="$2">$2</a>,gi' fail

To cut a long story short, here is a bash function that works and you can pipe Ansi output through:

aha-with-expanded-http-https-urls()
{
  aha | perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a 
}

It doesn't take into account RFC URI checking by regex as that's way too convoluted. If anyone wants that, adapt it according to the answers at http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

The biggest problem was to ensure it would skip the " terminating an URI at the end of the line. This can be in the testssl.sh output upon a 302-redirect. So the solution is somewhat tailored to testssl.sh output piped through aha.

A lot of digging finally resulted in this expression at https://regex101.com/r/zF3zQ2/2 Note that site forgets about the , as search separators, but that's OK: you can use the drop-down to choose another one or paste this full expression and it will happily use the , separator:

s,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a href="$2">$2</a>$4,gi

Getting there, one of the things I tried was negative lookahead but that failed. I tried following the example at for instance http://stackoverflow.com/questions/11028336/regex-to-match-a-pattern-and-exclude-list-of-string

So in the above solution, I went for a non-greedy .*? expression followed by matching either whitespace or the " followed by whitespace.

These are the separator, search and modifier part of the above expression:

,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\"\s),gi

Note the 2nd capturing group cannot do without the 3rd in order to match multiple protocols.

This is how it's assembled:

1st Capturing group ([^"])
- [^"] match a single character not present in the list below
- " a single character in the list " literally (case insensitive)
2nd Capturing group ((https?|s?ftp|ftps?|file)://.*?)
- 3rd Capturing group (https?|s?ftp|ftps?|file)
- 1st Alternative: https?
  - http matches the characters http literally (case insensitive)
- s? matches the character s literally (case insensitive)
  - Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
- 2nd Alternative: s?ftp
  - s? matches the character s literally (case insensitive)
  - Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
- ftp matches the characters ftp literally (case insensitive)
- 3rd Alternative: ftps?
  - ftp matches the characters ftp literally (case insensitive)
- s? matches the character s literally (case insensitive)
  - Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
- 4th Alternative: file
  - file matches the characters file literally (case insensitive)
- :// matches the characters :// literally
- .*? matches any character (except newline)
- Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
4th Capturing group ([\s]|\")
- 1st Alternative: [\s]
- [\s] match a single character present in the list below
  - \s match any white space character [\r\n\t\f ]
- 2nd Alternative: \"\s
- \& matches the character& literally
- quot; matches the characters quot; literally (case insensitive)
- \s match any white space character [\r\n\t\f ]
g modifier: global. All matches (don't return on first match)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

For replacement it's important to ensure all unique capturing groups end up in the output. Which means you can skip $3 (as it's part of $2) but have to include the others.

Which gets me to the replacement part of the expression:

$1<a href="$2">$2</a>$4

Test input:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. http://ziz.delphigl.com/tool_aha.php -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from https://testssl.sh/dev/
<span style="font-weight:bold;"> OCSP URI                     </span>http://clients1.google.com/ocsp
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg&quot;

Test output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. <a href="http://ziz.delphigl.com/tool_aha.php">http://ziz.delphigl.com/tool_aha.php</a> -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from <a href="https://testssl.sh/dev/">https://testssl.sh/dev/</a>
<span style="font-weight:bold;"> OCSP URI                     </span><a href="http://clients1.google.com/ocsp">http://clients1.google.com/ocsp</a>
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;<a href="https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg">https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg</a>&quot;

Test matches:

MATCH 1
1.  [168-169]   ` `
2.  [169-205]   `http://ziz.delphigl.com/tool_aha.php`
3.  [169-173]   `http`
4.  [205-206]   ` `
MATCH 2
1.  [286-287]   ` `
2.  [287-310]   `https://testssl.sh/dev/`
3.  [287-292]   `https`
4.  [310-311]   `
`
MATCH 3
1.  [379-380]   `>`
2.  [380-411]   `http://clients1.google.com/ocsp`
3.  [380-384]   `http`
4.  [411-412]   `
`
MATCH 4
1.  [512-513]   `;`
2.  [513-575]   `https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg`
3.  [513-518]   `https`
4.  [575-582]   `&quot;
`

Boy, that was a long comment (:

--jeroen

theZiz commented 8 years ago

Boy, that was a long comment :)

Thanks for the detailed clarification and I am very glad, that it works in the end so well for you. :D

jpluimers commented 8 years ago

All the more reason to re-open this issue as a low-prio one as the " handling is really something very awful: it assumes how AHA works and I'm pretty sure there are cases this handling won't work.

theZiz commented 8 years ago

I will have a look at this

theZiz commented 5 years ago

I didn't come up with a solution the last 3 years, so I guess, this can be finally closed. But thanks again for your nice regex work, jpluimers.

theZiz / aha

It would be nice if `aha` could render URLs as `a href` #20