microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

Define "normalized absolute URL" #58

Open gRegorLove opened 2 years ago

gRegorLove commented 2 years ago

This issue is split from https://github.com/microformats/microformats2-parsing/issues/9 intended to focus only on the process of normalizing URLs when parsing u-*.

Current language:

return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

One of the simplest things, which https://github.com/microformats/tests/pull/112 is waiting on, is whether to normalize an empty URL path component to "/". @jgarber623 detailed some specs and software that include this normalization, so I think this would be pretty agreeable among implementers.

@Zegnat raised the concern of defining what we mean by "path component" since parts of URLs have been renamed over the years. The IndieAuth spec includes a normative reference to WHATWG's URL standard and explains "path component" with a simple example instead of a spec definition:

As such, if a URL with no path component is ever encountered, it MUST be treated as if it had the path /. For example, if a user provides https://example.com for Discovery, the client MUST transform it to https://example.com/ when using it and comparing it.

So perhaps that would be sufficient for the microformats parsing spec, too?

RFC 3986 lists some additional normalizations that could be nice-to-have but I'm not sure if they are strictly necessary for parsers:

RFC 3986 also describes remove_dot_segments to normalize "." and ".." path segments. From a quick check, it appears at least php-mf2, mf2py, and Ruby parsers are all doing this, which makes sense since it's necessary to correclty handle <base href>.

Questions:

  1. Is the correct term for this process "normalization" or "canonicalization"?
  2. What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers" (to quote @Zegnat :))
Zegnat commented 2 years ago

Is the correct term for this process "normalization" or "canonicalization"?

I strongly feel like it is normali[sz]ation. Just like how RFC 3986 refers to it. Canonicali[sz]ation to me refers to what rel-canonical is used for, matching the definition from Wikipedia:

A canonical URL is a URL for defining the single source of truth for duplicate content.

There is no way for a parser like the mf2 parser to figure out that value, since it only has the string to work on. (I would be very much opposed to requiring mf2 parsers to fetch resources, look for rel-canonicals, etc.)

jgarber623 commented 2 years ago

Is the correct term for this process "normalization" or "canonicalization"?

"Normali[sz]ation" for the reasons @Zegnat noted above.

What are the simplest steps for this process such that it results in "a) easy for implementers to understand and b) leads to a useful output for consumers"

Maybe something like:

A URL's "path" is defined here as zero or more characters immediately following the host (and optional port) continuing until the end of the URL or the first question mark ? or hash #, whichever comes first. If the gotten value is zero characters in length, the normalized path is /.

Zegnat commented 2 years ago

Are paths always / if not empty? Even for non-HTTP URLs? IndieAuth is able to short-cut this somewhat as all URLs (except redirect URLs in special cases) are Special URLs, that is, HTTP(S) URLs.

gRegorLove commented 2 years ago

👍 on using "normalization."

And good catch -- we should differentiate schemes as part of the steps.

Loose ideas (not in spec language yet):

gRegorLove commented 2 years ago

While we're updating this section of text, I think we should include text to cover https://github.com/microformats/microformats2-parsing/issues/48#issuecomment-627094477 and https://github.com/microformats/php-mf2/issues/186.

snarfed commented 1 year ago

Is this the root cause of https://github.com/microformats/mf2py/issues/177#issuecomment-1404097654? ie, is it undefined whether normalizing https://tantek.com/? should drop the trailing ? and result in https://tantek.com/ ?

gRegorLove commented 1 year ago

@snarfed I think that's a good question to clarify for this issue, but with php-mf2 I think it's more a side effect than an explicit choice.

RFC3986 Component Recomposition seems to indicate the "?" should be preserved with the pseudocode and note:

      if defined(query) then
         append "?" to result;
         append query to result;
      endif;

Note that we are careful to preserve the distinction between a component that is undefined, meaning that its separator was not present in the reference, and a component that is empty, meaning that the separator was present and was immediately followed by the next component separator or the end of the reference.

https://tantek.com/? seems like it's the correct normalization in that case.