tidy-html5 generates invalid HTML5 when href elements contain illegal characters

ncouture commented 8 years ago

When asking tidy-html5 to fix URIs using the fix-uri option (that is active by default), invalid href element characters are present in the output of tidy-htm5.

case in point (fix-uri documentation):

fix-uri

Type: Boolean Default: yes Example: y/n, yes/no, t/f, true/false, 1/0

This option specifies if Tidy should check attribute values that carry URIs for illegal characters and if such are found, escape them as HTML4 recommends.

example and reproduction steps

 $ echo '<meta charset="utf-8"><title>xyz<a href=":|">invalid HTML5</a>' | tidy --doct --tidy-mark no --fix-uri yes - 2> /dev/null

output

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>xyz</title>
</head>
<body>
<a href=":|">invalid HTML5</a>
</body>
</html>

$ echo $(!!) | html5check -h

output:

Error: Bad value “:|” for attribute “href” on element “a”: Illegal character in path segment: not a URL code point.
From line 1, column 88; to line 1, column 100
There were errors. (Tried in the text/html mode.)

html5check: https://about.validator.nu/html5check.py

geoffmcl commented 8 years ago

@ncouture thanks for reporting...

Only had time for brief test tonight, but can confirm current tidy seems to ignore those bad chars... and indeed so does tidy circa 2009, even back to tidy 2000...

Are we missing something in configuration here? Do we need to enable something else, aside from fix-uri which is on by default?

Tidy has a CheckUrl (CH_URI) service, but is does not yet seem called on your sample, but even if it was, I can not see where it would flag those characters as bad?

I am tentatively marking it as a bug, and will explore more...

Maybe others have comments, ideas, patches, PR... and maybe some W3C references on valid URI charcters... thanks...

ncouture commented 8 years ago

To answer your question I have two comments and one question:

URLs are clearer when they do NOT contain encoded characters
Valid HTML is better than non-valid HTML

Why are characters like "pipe" classified as invalid for use in HREF elements???

geoffmcl commented 8 years ago

@ncouture thanks for the further comment...

First, I have no idea why the W3C validator classifies the |, "pipe" character as invalid in the href attribute, nor why it then points to a unicode reference nfc-faq???

And as indicated, even if the existing internal tidy service CheckUrl was being called/used it would only escape characters in the range if ((c > 0x7e) || (c <= 0x20) || (strchr("<>", c))), so would also not escape the pipe characters. I note the validator also gripes about < and > in hrefs...

Accordingly, for now I am removing the bug classification. Not that it is not a 'bug', but since it seems tidy has never done this escaping of a pipe char, it is a feature request. And further it seems clear we should clear up the documentaion on exactly what the fix-uri option does do, so added docs.

While as expressed many times, tidy is not a validator per se, it should, in most cases, be able to report the same warnings/errors as the validator, and where possible tidy should try to help the user fix the document. So, in general a tidied document should pass W3C validation.

I did read that a href can contain a leading and/or trailing spaces, that need not be escaped. See #345 for some discussion on this.

I will keep searching, reading, but seek other W3C references to clear up what are valid URL characters. Comments, patches, PR very welcome to solve this, so adding Help Needed.

zcorpan commented 8 years ago

I would recommend implementing, or using an implementation of, https://url.spec.whatwg.org/ to fix this properly, since characters need different treatment in different parts of a URL. And some strings don't resolve at all, like e.g. http://[, where it's unclear what Tidy should do with (maybe just leave it alone and log an error).

geoffmcl commented 8 years ago

Note, per #378, this URL escaping includes accented characters...

geoffmcl commented 8 years ago

@ncouture, @zcorpan thanks for the comments, but since no one has stepped forward to do this, and I am still unclear on what Tidy should do, am moving the milestone to 5.3.

We are considering a release 5.2 shortly, and it seems unlikely that we will have an agreed, implemented, and tested solution before then...

geoffmcl commented 7 years ago

Since we are about to release 5.4, and nothing more has been done on this to date, moving it to next 5.5...

As mention I am still unclear on what is needed here... what is correct... so would appreciate comments, patches or PR to take this forward... thanks...

zmwangx commented 7 years ago

Chiming in late: | (pipe) isn't a valid URI character per RFC 3986. Just read the ABNF (I quote the ones related to the path segment):

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   query         = *( pchar / "/" / "?" )

   fragment      = *( pchar / "/" / "?" )

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

RFC 3987 extends unreserved to iunreserved:


   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

but | is still invalid.

I also looked into the source code of the W3 validator: https://github.com/validator/validator. The URI validation is done by https://github.com/validator/galimatias. In particular, the validity of code points is checked by isURLCodePoint, which I quote in full below:

    public static boolean isURLCodePoint(final int c) {
        return
                isASCIIAlphanumeric(c) ||
                        c == '!' ||
                        c == '$' ||
                        c == '&' ||
                        c == '\'' ||
                        c == '(' ||
                        c == ')' ||
                        c == '*' ||
                        c == '+' ||
                        c == ',' ||
                        c == '-' ||
                        c == '.' ||
                        c == '/' ||
                        c == ':' ||
                        c == ';' ||
                        c == '=' ||
                        c == '?' ||
                        c == '@' ||
                        c == '_' ||
                        c == '~' ||
                        (c >= 0x00A0 && c <= 0xD7FF) ||
                        (c >= 0xE000 && c <= 0xFDCF) ||
                        (c >= 0xFDF0 && c <= 0xFFEF) ||
                        (c >= 0x10000 && c <= 0x1FFFD) ||
                        (c >= 0x20000 && c <= 0x2FFFD) ||
                        (c >= 0x30000 && c <= 0x3FFFD) ||
                        (c >= 0x40000 && c <= 0x4FFFD) ||
                        (c >= 0x50000 && c <= 0x5FFFD) ||
                        (c >= 0x60000 && c <= 0x6FFFD) ||
                        (c >= 0x70000 && c <= 0x7FFFD) ||
                        (c >= 0x80000 && c <= 0x8FFFD) ||
                        (c >= 0x90000 && c <= 0x9FFFD) ||
                        (c >= 0xA0000 && c <= 0xAFFFD) ||
                        (c >= 0xB0000 && c <= 0xBFFFD) ||
                        (c >= 0xC0000 && c <= 0xCFFFD) ||
                        (c >= 0xD0000 && c <= 0xDFFFD) ||
                        (c >= 0xE0000 && c <= 0xEFFFD) ||
                        (c >= 0xF0000 && c <= 0xFFFFD) ||
                        (c >= 0x100000 && c <= 0x10FFFD);
    }

zcorpan commented 7 years ago

That code you quoted is an implementation of https://url.spec.whatwg.org/#url-code-points

My recommendation above still stands. :-)

balthisar commented 7 years ago

I'm finally looking into this, but...

Tidy has a CheckUrl (CH_URI) service, but is does not yet seem called on your sample, but even if it was, I can not see where it would flag those characters as bad?

@geoffmcl, have you done any work in here recently? When testing this, I'm getting sent into CheckUrl ultimately via AttributeChecks.

Just trying to make sure I'm not misinterpreting too much of the above before trying to implement the desired checking.

balthisar commented 7 years ago

I've added an issue_352 branch that does fairly well in adding a warning when illegal characters are detected. Some testing would be nice, as I foresee some possibility of issues with complicated URL's. Have a look at the fd773121752 commit message for possible additional help needed.

balthisar commented 7 years ago

Oops, meant to ping @zcorpan, @zmwangx, @ncouture, testing this branch with complicated URL's would be appreciated, folks.

balthisar commented 7 years ago

I should also add, after significant testing, I might consider making this an error (instead of a warning), or add a new option to mute/warning/error, because in principle, Tidy shouldn't generate invalid code.

geoffmcl commented 7 years ago

@balthisar the first problem seems to be getting a valid sample to test...

Maybe it is due to the cut-paste, or difference in the code page, or something, but what should the value be in the first <a href="€">, line 8? And this symbol is repeated twice more in the document.

You have added a <meta charset="utf-8"> but when I cut that sample into a file, it is not utf-8! Its value is 0x80. In the last URI, you do indicate the correct U+20AC code point, <a href="%E2%82%AC">, line 20.

So of course Tidy, any tidy, will complain line 8 column 11 - Warning: replacing invalid UTF-8 bytes (char. code U+0080)

Then Tidy will just generically warn - line 12 column 3 - Warning: <a> illegal characters found in URI - I think it would good if tidy could report, at least the first code point it found invalid. In this case I think it is the [ char, but it would be helpful to know... if possible... maybe too difficult...

Now I expected a warning - `line 11 column ? - Warning: <a> illegal characters found in URI, which is one of the samples discussed, namely <a href=":|">, but did not get one, but maybe I did something wrong here with the lines, and need to check this more in the debugger... to check exactly...

But is there are way to add a utf-8 Euro sign utf-8 sample... maybe by drag and drop... otherwise would appreciate if you could say zip it and email it directly...

Certain agree with tidy should not generate invalid html... ever...

Would really like to help in closing this... and will test, report, comment, etc... as reported it has existed back to 2000! Thanks...

geoffmcl commented 7 years ago

@balthisar just got into running debug, and hit my first little problem...

But first I did find the warning I expected... I had miscounted the lines... I expected and got line 12 column 3... so that is fine...

In Windows, the service isalnum(c) will trigger an assert if passed 0xfffd - ie the internal value after getting 0x80. From the MSN site "The behavior of isalnum and _isalnum_l is undefined if c is not EOF or in the range 0 through 0xFF, inclusive." Ugh!

But it was quite simple, maybe ugly, to write a work around, which would work for unix/MAC too, if we want to get rid of the #ifdef _WIN32 switch...

diff --git a/src/attrs.c b/src/attrs.c
index cbcb6a9..4f1c449 100644
--- a/src/attrs.c
+++ b/src/attrs.c
@@ -1475,13 +1475,21 @@ static void CheckLowerCaseAttrValue( TidyDocImpl* doc, Node *node, AttVal *attva
 }

 /* methods for checking value of a specific attribute */
+#ifdef _WIN32
+#define ISUPPER(a) ((a >= 'A') && (a <= 'Z'))
+#define ISLOWER(a) ((a >= 'a') && (a <= 'z'))
+#define ISNUMERIC(a) ((a >= '0') && (a <= '9'))
+#define ISALNUM(a) (ISUPPER(a) || ISLOWER(a) || ISNUMERIC(a))
+#else
+#define ISALNUM(a)  isalnum(a)
+#endif

 static Bool IsURLCodePoint( ctmbstr p, uint *increment )
 {
     uint c;
     *increment = TY_(GetUTF8)( p, &c ) + 1;

-    return isalnum( c ) ||
+    return ISALNUM( c ) ||
         c == '%' ||    /* not a valid codepoint, but an escape sequence */
         c == '#' ||    /* not a valid codepoint, but a delimiter */
         c == '!' ||

Could push this to the branch, for one or for all... ie no switch... What do you think?

As stated, would love that the message <a> illegal characters found in URI actually showed, at least the first, illegal character, but could not see an easy solution yet... but will keep searching...

Still trying to work on a utf-8 URL test... but otherwise it is all looking good...

Wow, ok found a little free hex editor (for windows - http://www.techsupportalert.com/best-free-hex-editor.htm - hxd.exe) and was able to manually modify the 3 instances of 0x80 to the correct utf-8 - 0xe2 0x82 0xac - and everything seemed to work fine in initial testing... ie no noise about invalid character coding...

In default mode, the 3 utf-8 bytes are escaped and I get the messages, excluding the two expected invalid messages -

line 8 column 2 - Warning: <a> escaping malformed URI reference
line 10 column 3 - Warning: <a> escaping malformed URI reference

Then if I added the option --fix-uri no, I got -

line 8 column 2 - Warning: <a> improperly escaped URI reference
line 10 column 3 - Warning: <a> improperly escaped URI reference

As usual pushed in_352-4.html to my repo, with correct utf-8 encoding...

And as expected, the W3C validator only barked about the invalid <a href=":|"> and <a href="http://[">... fully accepting the two unescaped utf-8 uri...

Of course it also barked about your <IMG SRC=..blah..>, but that is nothing to do with this, and wonder why it was included in your sample??? Must get around to removing it...

Now to ponder about this a little more... maybe sleep on it... but this seems very good... thanks...

balthisar commented 7 years ago

@geoffmcl, thanks for taking a look. Yeah, the Euro symbol is a problem with different code pages. I should have pointed out that my source it UTF-8, and I chose that symbol just so that I'd definitely have a three byte character for testing. Strictly speaking, I should probably add a two-byte character, too.

One thing that I hesitated about for this fix was definitely the character encoding, which a library usually handles for me. I know that at some point in Tidy, we're 100% UTF-8 internally, so I think that if the file is read with the correct encoding, then my logic works universally. Please correct me if I'm wrong (and where, exactly, is our document text stabilized in any case?).

Yeah, I'm glad you noticed the <a href="%E2%82%AC">, which is what I originally used.

In any case, all of the UTF stuff is in the sample case because I obviously can't count on a char being a whole character, and so maybe they distract from the case that we're correcting. If I'd not put any multi-byte UTF in there, I would have naively thought everything was okay with just the samples in this issue! They're also tangentially related to #378, which we might address at the same time as this issue.

I'm fine with the macro, or else we can drop ISALNUM completely and just use the comparisons from the macro. It might be convenient to have the macro in tidyplatform.h so that it's available elsewhere should we need it, too.

The img thing was copied from one of initially failing test cases (because of the #, which strictly speaking isn't a legal code point, but a delimiter). Anything with a # should be okay.

As stated, would love that the message <a> illegal characters found in URI actually showed, at least the first, illegal character, but could not see an easy solution yet...

Agreed. There's no simple solution. Maybe in the short term leave it. As a continuation of my previous work on the messaging system, I plan to eventually make things more flexible, such as being able to pass arbitrary information to the message routines. I thought I'd try to get to some of our oldest bugs, first!

zcorpan commented 7 years ago

https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json has some good tests. The ones with "failure": true means they should fail to parse as a URL (so should all have a warning or error, but might not be possible to do any fixup).

geoffmcl commented 7 years ago

@zcorpan wow thanks for the giant json list... it contains some 470+ samples...

In a perl script, I was able to decode the json, and generate a list to test... of course without reading all the specifications on what this list means, it seems if the "input": "http://f:21/ b ? d # e " produces a "href": "http://f:21/%20b%20?%20d%20# e" it is valid... that is does not have "failure": true... But that specific case puzzles me, because why have they not escaped the final # e? Tidy will, so we get the output ending in #%20e, which seems right... I think...

And as @balthisar wrote somewhere, without getting deeply into parsing each component of the href, Tidy has no way to failed 375:i: <a href="http://10000000000">failed</a><br>, as they do. And of course it follows that they pass 376:h: <a href="http://10000000000.com/">valid</a><br>. The leading number here is just the approximate order in the json list...

And you can see more of that deep deep parsing with 469:h: <a href="non-special://[1:2::3]:80/">valid</a> being valid, while 470:i: <a href="non-special://[:80/">failed</a> failed... assuming because it could not correctly parse a "host":... there are so many more like that...

FWIIW I have added my generated list as in_352-5.html to my repo so others can get into testing and understanding...

So I think what @balthisar has done here is a reasonable start, but really should libtidy get deep into href parsing? I think not, at least not at this stage...

@balthisar but did not quite understand what you are saying about utf-8 characters, also as in #378, but from the validator tests it seems they are valid, escaped or un-escaped, but need to read and understand more on this...

And I do tend to agree with if a know legible user adds --fix-uri no, maybe we should not output some warnings... but again need to investigate more...

I will push my macro soonest, and experiment with the list a little more, in the hope we can fix and/or accept some more hrefs... thanks...

zcorpan commented 7 years ago

The tests without "failure": true may or may not be valid; these tests are intended to test conformance for parsers, not for validators or fixup-ers. (But it would be possible to add a new field to carry data on validation error positions.)

Technically it's not necessary to parse a URL into a URL record to determine if it's valid. https://url.spec.whatwg.org/#url-writing has rules for what constitutes a valid URL -- "URL code points", which the above PR kinda follows, is but one part of it.

geoffmcl commented 7 years ago

@zcorpan thanks for the clarification on what that json list represents, especially that tests without failure may or may not be valid...

And thanks for the link to the url-writing rules, but wow, there is a lot to take in there... still to read it again, and again, and all its links ;=)) it would be a tricky business to validate and fixup everything possible...

@balthisar I eventually pushed the macro as I had first written it, after reading isalnum(c) does not seem to have the same problem in unix... but anyway its there now if we do need it later, or for something else...

At this point I can see no problem merging #547 as is... this is an improvement... we can take up further issues as they arise... thanks...

balthisar commented 7 years ago

Thanks, I'll try to merge it tonight, possibly add a test case with some more meaningful tests to testbase.

balthisar commented 7 years ago

Oops, should have closed this after merging #547.

htacg / tidy-html5

tidy-html5 generates invalid HTML5 when href elements contain illegal characters #352

case in point (fix-uri documentation):

example and reproduction steps