theZiz / aha

Ansi HTML Adapter
Other
915 stars 88 forks source link

It would be nice if `aha` could render URLs as `a href` #20

Closed jpluimers closed 5 years ago

jpluimers commented 8 years ago

See the dump at https://gist.github.com/6d536c6ed8af20bcacb0d89077101f41

When you look at the rendered html https://rawgit.com/jpluimers/6d536c6ed8af20bcacb0d89077101f41/raw/19bbaad542764ce5f99fce66ed313dc9caf2e834/testssl.sh.html, you see these URLs are not a href tags:

It would be cool if they did (but I understand this would make parsing more than a tad complexer).

theZiz commented 8 years ago

Hm, interesting idea. Unfortunately aha does not look more than one byte in the future. However I would need to find a whole "http://" or "https://" and to add the "<a href=" right before. So it is not possible for now.

But you could add a sed command between your output and aha or between aha and your piping to an html file, which formats urls like this ./testssl.sh | sed 's,\(https\?://[^ ]*\),<a href="\1">,g' | aha > testssl.sh.html

theZiz commented 8 years ago

sed 's,\(https\?://[^ ]*\),<a href="\1">,gI' is slightly better as it also understand uppercase URLs like "HTTP://WWW.XKCD.COM/208"

jpluimers commented 8 years ago

I'm a regex noop, so bear with me.

The backslash escapes are to ensure the final regex executed is this, right?

(https?://[^ ]*),<a href="1">,gI

Guessing the 1 is a parameter expansion, shouldn't the regex be come more like

(https?://[^ ]*),<a href="1">1</a>,gI
theZiz commented 8 years ago

Well the ( has to be escaped with \, so that sed don't just think we want to search for (https://)www.google.de Same for ?, which determines, that the "s" is optional. The \1 means, that the matched string should pasted here, in that case the url. [^ ] is every letter except space. So all in all:

theZiz commented 8 years ago

Ah, NOW I got your question. I of course forgot to add a </a> and url name... So the whole correct regex would be: sed 's,\(https\?://[^ ]*\),<a href="\1">\1</a>,gI'

jpluimers commented 8 years ago

@theZiz your solution put me on the right track, but it didn't fully solve the problem for a few reasons:

  1. aha would replace the generate < and > characters in the anchor element with &lt; and &gt; so the regular expression would not work
  2. after moving aha in front of sed I found out that on Mac OS X, the I option is not supported: you will get a bad flag in substitute command: 'I' when executing sed 's,\(https\?://[^ ]*\),<a href="\1">\1</a>,gI'
  3. after an initial port of the regular expression replacement to perl I found out it replaced too much (as it now operated on aha generated html) which made even perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://[^\s]*),$1<a href="$2">$2</a>,gi' fail

To cut a long story short, here is a bash function that works and you can pipe Ansi output through:

aha-with-expanded-http-https-urls()
{
  aha | perl -C -Mutf8 -pe 's,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a 
}

It doesn't take into account RFC URI checking by regex as that's way too convoluted. If anyone wants that, adapt it according to the answers at http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

The biggest problem was to ensure it would skip the &quot; terminating an URI at the end of the line. This can be in the testssl.sh output upon a 302-redirect. So the solution is somewhat tailored to testssl.sh output piped through aha.

A lot of digging finally resulted in this expression at https://regex101.com/r/zF3zQ2/2 Note that site forgets about the , as search separators, but that's OK: you can use the drop-down to choose another one or paste this full expression and it will happily use the , separator:

s,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),$1<a href="$2">$2</a>$4,gi

Getting there, one of the things I tried was negative lookahead but that failed. I tried following the example at for instance http://stackoverflow.com/questions/11028336/regex-to-match-a-pattern-and-exclude-list-of-string

So in the above solution, I went for a non-greedy .*? expression followed by matching either whitespace or the &quot; followed by whitespace.

These are the separator, search and modifier part of the above expression:

,([^"])((https?|s?ftp|ftps?|file)://.*?)([\s]|\&quot;\s),gi

Note the 2nd capturing group cannot do without the 3rd in order to match multiple protocols.

This is how it's assembled:

For replacement it's important to ensure all unique capturing groups end up in the output. Which means you can skip $3 (as it's part of $2) but have to include the others.

Which gets me to the replacement part of the expression:

$1<a href="$2">$2</a>$4

Test input:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. http://ziz.delphigl.com/tool_aha.php -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from https://testssl.sh/dev/
<span style="font-weight:bold;"> OCSP URI                     </span>http://clients1.google.com/ocsp
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg&quot;

Test output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- This file was created with the aha Ansi HTML Adapter. <a href="http://ziz.delphigl.com/tool_aha.php">http://ziz.delphigl.com/tool_aha.php</a> -->
<html xmlns="http://www.w3.org/1999/xhtml">
    testssl.sh       2.7dev from <a href="https://testssl.sh/dev/">https://testssl.sh/dev/</a>
<span style="font-weight:bold;"> OCSP URI                     </span><a href="http://clients1.google.com/ocsp">http://clients1.google.com/ocsp</a>
<span style="font-weight:bold;"> HTTP Status Code           </span>  302 Found, redirecting to &quot;<a href="https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg">https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg</a>&quot;

Test matches:

MATCH 1
1.  [168-169]   ` `
2.  [169-205]   `http://ziz.delphigl.com/tool_aha.php`
3.  [169-173]   `http`
4.  [205-206]   ` `
MATCH 2
1.  [286-287]   ` `
2.  [287-310]   `https://testssl.sh/dev/`
3.  [287-292]   `https`
4.  [310-311]   `
`
MATCH 3
1.  [379-380]   `>`
2.  [380-411]   `http://clients1.google.com/ocsp`
3.  [380-384]   `http`
4.  [411-412]   `
`
MATCH 4
1.  [512-513]   `;`
2.  [513-575]   `https://www.google.nl/?gfe_rd=cr&amp;ei=ZWjmV86hE5LH8AeFmaP4Bg`
3.  [513-518]   `https`
4.  [575-582]   `&quot;
`

Boy, that was a long comment (:

--jeroen

theZiz commented 8 years ago

Boy, that was a long comment :)

Thanks for the detailed clarification and I am very glad, that it works in the end so well for you. :D

jpluimers commented 8 years ago

All the more reason to re-open this issue as a low-prio one as the &quot; handling is really something very awful: it assumes how AHA works and I'm pretty sure there are cases this handling won't work.

theZiz commented 8 years ago

I will have a look at this

theZiz commented 5 years ago

I didn't come up with a solution the last 3 years, so I guess, this can be finally closed. But thanks again for your nice regex work, jpluimers.