microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

"return the normalized absolute URL" for invalid URLs? #9

Open sknebel opened 7 years ago

sknebel commented 7 years ago

http://microformats.org/wiki/microformats2-parsing#parsing_a_u-_property

if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs

I looked at this in the context of https://github.com/tommorris/mf2py/issues/79, which is a crash due to the attempt to normalize the invalid URL http://www.southside.de]. Obviously, crashing the parser is not good behavior. Feedback in IRC can be summarized as "if it is not a valid URL, just pass the raw value through". Given that further steps in the parsing allow for arbitrary values to be returned and the consumer thus has to be prepared to handle any of them anyways this seems acceptable, but I'd still like to see it clarified in the parsing documentation. (An alternative would be dropping the value entirely, but I'm not sure if this is not more surprising and as far as I know isn't done in any other case of mf2 parsing)

(Some background reading regarding URL parsing and normalization: RFC3986 - Uniform Resource Identifier (URI): Generic Syntax and the WHATWG URL Standard both clearly describe the URL as invalid. The WHATWG spec explicitly describes parsing to "return failure" for invalid URLs)

gRegorLove commented 7 years ago

I think the parsers shouldn't (or don't need to) validate URLs. From https://en.wikipedia.org/wiki/URL_normalization#Normalizations_that_preserve_semantics, I think converting the scheme and domain to lower case could be a nice-to-have. I don't think the other normalizations in that section are necessary, though. Thoughts?

sknebel commented 7 years ago

In the Python implementation, the Python stdlib functions used because they handle resolving relative URLs throw an exception when they encounter an invalid URL. It does not happen due to any extra normalization attempt (although it might be another issue to clarify what "normalization" means, or if it should be removed).

I'd side with the quoted advice: Invalid URLs can not receive any URL-specific handling and thus will be returned as-is. This could be added to the spec as something like (please suggest better wording!)

if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs

  • if a valid URL can not be formed from the value, return the value as-is instead

In mf2py, this would then be implemented by catching any such exceptions and returning the original value instead.

tantek commented 6 years ago

Can you provide an actual markup example that demonstrates the question?

For absolute URL values, the parser should just pass the value through, no processing or validation required.

For relative URL values, presumably "document's language's rules for resolving relative URLs" already handles making sure the base URL is valid.

Thus resolving a relative URL value could at most require URL escaping, which we could consider adding. However I’d rather first see an actual example markup that demonstrates how you could pass an "invalid [relative] URL" to the mf2 parser before adding another processing requirement (URL escaping) to the spec. Otherwise I'd prefer to close this issue with no changes needed.

(Originally published at: http://tantek.com/2018/107/t1/)

sknebel commented 6 years ago

Copy-Paste from the page mentioned, which crashed (and actually still does) the mf2 parser:

<a href="http://www.southside.de]" rel="nofollow">http://www.southside.de]</a>

or if you prefer it in an mf2 thing:

<a href="http://www.southside.de]" class="h-card">http://www.southside.de]</a>

Normalization without further specification is not just applicable to relative URLs, but not clear for things that are not valid URLs.

Zegnat commented 6 years ago

After reminding myself about what this was about again, there actually seem to be two things mentioned in this one issue. These may need to be addressed separately:

  1. What does the mf2 spec consider to be “normalized”? RFC 3986 section 6.2 lists several normalisations. Those normalisations are not limited to relative URLs.

    • If the mf2 parser should “just pass the value through” for absolute URLs, as @tantek writes, it may be an idea to just drop the word “normalized”. Then the parsers only need to worry about resolving relative URLs to absolute ones.

    • If the mf2 parser should do normalisation (because some of it may be “nice-to-have”), the specification should specify which normalisations it expects a parser to do.

  2. What is the “absolute URL” value of something that isn’t a URL at all? This can be caused by typos, like in the example shown by @sknebel, or other author errors. The amount of errors may go up after resolving #10, as the number of cases where parsers get their URLs from places HTML does not expect them (textContent or data[value] are clear examples) may go up.

    Another possible source for more false-positives might come from pages using CSS frameworks with the u- class-prefix. These already show up, but after resolving #10 the chance the parsers must try to resolve these faulty values goes up.

    • If the mf2 parser should ignore errors it encounters when trying to parse and resolve URLs, strings that aren’t URLs at all can just be used as-is. Following the ”just pass the value through” policy.

    • If the mf2 parser is expected to stick to URL standards as soon as the u- switch gets flipped, it does not have any expected behaviour for non-valid URL values. If the standards say resolving something like :/woops! results in nothing at all, returning an empty string value may also be fine.

      We do not often default to empty string values in mf2 output though. dt- through vcp may be the only other place where we do this? Because that has some sort of built in validation as well, which may be the wanted behaviour for u- too.

My first thought is to drop normalisation references in the spec. Normalisation means something specific in the RFC for URLs, and I do not think mf2 parsers need to get into that. The same goes for non-URLs: rather than putting the onus on the mf2 parsers to do all sorts of URL parsing, lets just assume that an unrecognised URL is valid for whatever the author is intending.

Thus I would propose something like:

  • if the gotten value is a relative-URL string, then return the result of resolving it to an absolute URL following the containing document’s language’s rules (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).
  • else return the gotten value as is.

With this, parsers only need to recognise when a gotten value is something they can resolve. Which sounds like a much clearer expectation to me.

(The term “relative-URL string” is being borrowed from the WHATWG URL spec here.)

[Edited to mark the <base> tag as code, because GitHub was stripping it.]

snarfed commented 6 years ago

(fwiw mf2py no longer crashes on invalid URLs like these.)

jgarber623 commented 4 years ago

This issue came up in chat (via @gRegorLove) and is related to microformats/tests#112:

I don't think the only-domain URLs are supposed to be normalized adding the trailing slash like that.

@Zegnat noted in a reply:

It is hard to say what normalisation is supposed or not supposed to happen. I think we still do not define what a “normalized absolute URL” is, as the spec always calls it. I think it is not weird for libraries to require some value for path, which at minimum is the /. (Ie. it is not a “trailing slash”, it is “root path”.) I think the IndieAuth specification even specifically mentions adding the last /, because of this reason: cannot have an empty path component to a URL.

Links, Prior Art, etc.

Converting an empty path to a / path is listed under the Normalizations that preserve semantics page linked from @gRegorLove's comment above. That section on Wikipedia links to RFC 3986 section 6.2.3 (oh, hey…), Scheme-Based Normalization, with the note:

In general, a URI that uses the generic syntax for authority with an empty path should be normalized to a path of "/".

In the IndieAuth spec, Section 3.3 URL Canonicalization addresses normalization/canonicalization (emphasis added):

Since IndieAuth uses https/http URLs which fall under what [URL] calls "Special URLs", a string with no path component is not a valid [URL]. As such, if a URL with no path component is ever encountered, it MUST be treated as if it had the path /. For example, if a user enters https://example.com as their profile URL, the client MUST transform it to https://example.com/ when using it and comparing it.

JavaScript's native URL interface will normalize an empty path to /:

new URL('http://example.com').pathname
// returns "/"

Ruby's native URI class behaves similarly:

irb(main)> URI.parse('http://example.com').normalize
=> #<URI::HTTP http://example.com/>

The popular Addressable Ruby gem also behaves similarly:

irb(main)> Addressable::URI.parse('http://example.com').normalize
=> #<Addressable::URI URI:http://example.com/>

Those are the two languages I'm most familiar with so I'm curious to learn what other languages (Python, Go, etc.) do.

Questions

On the last question, my vote is to normalize empty paths to / based on the evidence cited above in the Links, Prior Art, etc. section of this comment.

Thanks for reading and considering this proposal!

Edit:

¹ Never make assumptions about standards, via @tantek

gRegorLove commented 4 years ago

An updated proposal:

  • if the gotten value is a relative-URL string, let url be the result of resolving it to an absolute URL following the containing document’s language’s rules (e.g. in HTML, use the current URL context as determined by the page, and first element, if any).
  • else let url be the gotten value as is
  • if the url has no path component, set it to /
  • return the url
jgarber623 commented 4 years ago

+1 for @gRegorLove’s proposal!

Zegnat commented 4 years ago

if the url has no path component, set it to /

I know I am always overly sensitive to spec language, but I would like to define what we mean by “path” if we are going to refer to parts of a URL. Too many different specifications have come and gone renaming parts of URLs. Probably best illustrated by @tantek back in 2011: URL parts as (re)named over the years.

I would also like to clarify that we will be foregoing any and all forms of normalisation with this proposal?

To get around issues as the original reason for this to have been filed (an errant ]) could we maybe add a fail-clause somewhere?

My over-the-top-specificity counter proposal, this would cover failure states as well as give us some nice-to-haves like lowercased schemes & hostnames, and non-empty paths:

  • Parse the gotten value following WHATWG’s URL specification’s URL parsing with the base URL value set in accordance to the containing document’s language’s rules (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).
    • If parsing results in validation errors or was impossible to complete: return the original gotten value as is.
    • Else if parsing succeeded: return the serialized parsing results following URL serializing from the same specification.

I would love to fall somewhere between what @gRegorLove and I are proposing if we can iterate on something that is a) easy to understand for implementers, and b) leads to a useful output for consumers.

(Edited 2020-05-02 to add lowercasing of hostnames, noticed this when testing the Live URL Viewer.)

Zegnat commented 4 years ago

Per https://github.com/microformats/tests/pull/112#pullrequestreview-404489613: should we split out normalisation to a different ticket and focus only on invalid URLs here?

It sounds like everyone’s preferred behaviour would be to pass on the value as-is if the parser for any reasons fails to find a “normalized absolute URL of the gotten value”. So the following change to the specification would clarify this:

  • Create the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first element, if any).
    • If successful, return the normalized absolute URL,
    • else return the gotten value as-is.

This change is kept minimal as it is not meant to clarify what is meant by “normalized absolute URL”, it only clarifies that if a parser cannot obtain such a value (as the Python parser previously displayed when crashing on http://www.southside.de]) the raw string without any processing should be returned.

jgarber623 commented 4 years ago

@Zegnat:

should we split out normalisation to a different ticket and focus only on invalid URLs here?

I think opening a new issue is reasonable given the discussion happening over at microformats/test#112. "What is a normalized URL" is a distinct issue from handling invalid URLs. Apologies for conflating the two here.

So the following change to the specification would clarify this: […]

+1 approval for your specific change regarding handling of invalid URLs.

gRegorLove commented 4 years ago

A separate issue sounds good. The primary reason for my proposal was because https://github.com/microformats/tests/pull/112 appeared like it would be merged soon and I wanted the spec to be updated beforehand. I'm definitely in favor of an incremental spec change to that end. I don't have strong opinions about referencing WHATWG url parsing. That might make sense in the larger normalization conversation.

jalcine commented 2 years ago

Going to bump this as I'm helping implementing a Rust parser and this URL normalization is something I can work around but would like to know if I have to 😉

jalcine commented 2 years ago

For context, the Rust parser will add a trailing path (as per WHATWG conventions).

gRegorLove commented 2 years ago

I realized we never split off the separate issue for normalization, so I've done that with https://github.com/microformats/microformats2-parsing/issues/58.

For this issue of invalid URLs, I think @Zegnat's suggestion in https://github.com/microformats/microformats2-parsing/issues/9#issuecomment-622887901 is solid. Repeating it here with a small fix so the <base> shows up:

  • Create the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).
    • If successful, return the normalized absolute URL,
    • else return the gotten value as-is.

I'm +1 for this and confirmed php-mf2 does this currently.

Can we revisit this, get some votes and any objections/feedback?

jgarber623 commented 2 years ago

I realized we never split off the separate issue for normalization, so I've done that with #58.

Thanks! 🙌🏻

Can we revisit this, get some votes and any objections/feedback?

For clarity's sake, can we get one or two examples of input and expected output? I think the proposed verbiage changes makes sense, but want to make sure we have some examples we can quickly spin into tests for the tests repo.