Open willnorris opened 6 years ago
The lack of the <base>
element in the parsed result, as well as the fact that some parsers already do this, make me lean towards adding it to the spec as well.
I agree. My first reaction was to keep the HTML as close to the original as possible (including relative URLs), but the embedding use-case won me over! The relative URLs cannot stand on their own in the HTML fragment, as we would have lost meaning. Resolving them makes a lot of sense!
Because these issues tend to grow stale if there is no concrete proposal, here is one.
I propose replacing this line in the parsing specification for e-*
:
html
: theinnerHTML
of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing whitespace removed.
With:
html
: the results of running the HTML fragment serialization algorithm from the HTML specification on the element, with:
- relative URLs in HTML attributes that take URLs as values replaced with matching normalized absolute URLs, following the containing document’s language’s rules for resolving relative URLs;
- leading/trailing whitespace removed;
The attributes table in the HTML specification is one of the few resources that tells us where URLs are, without having to leave the parser implementer guessing about what strings in the HTML fragment constitute relative URLs. This came out of a discussion with regards to the Webmention spec (w3c/webmention#91).
The “… following the containing document’s language’s rules …” phrasing is taken straight from other places where we resolve URLs.
I have kept the “with” phrasing because I am not sure how to best sum up the actions that need to be taken by the parser. Most implementers will probably first resolve the URLs, then run the serialisation algorithm, and then remove whitespace. That’s a slightly different order from what this proposal suggests. Let me know if there are better ways to phrase this.
Looks good.
Can anyone think of use-cases that would prefer URLs not to be normalized? Theoretical example I can come up with: recovery of content from a site with mf2 markup, where relative URLs to pictures, other posts, ... might be preferred.
If that's an issue, the parser should at least return the found base URL for the page, so later steps can resolve if they want. In the majority of cases, the HTML from the e-*
properties has to be postprocessed anyways (filtering safe tags, replacing images with proxied versions, ...), and resolving would then add "just" another step to that.
Sounds good. I'm +1 on @Zegnat's proposal.
@Zegnat's proposal has been implemented in the go parser, though it is not currently enabled.
This came up recently in https://github.com/snarfed/bridgy-fed/issues/390#issuecomment-1408879534. php-mf2 currently resolves relative URLs in e-*
, mf2py doesn't. Shall we try again to get @Zegnat's proposal ^ into the spec? Also, mf2py and other parser contributors, any thoughts? @sknebel @tommorris @angelogladding
MicroMicro resolves relative URLs within e-*
properties. I don’t remember building it that way for any other reason than to pass the microformats2 test suite.
mf2py has this ready to go (and thus I guess votes in favor of this change)
This is supported by the Rust Microformats parser and is demonstrated in the tested documentation, as this is done generically with plain text.
(Originally published at: https://jacky.wtf/2023/11/u1y4)
I'm also in favor of this change!
(Originally published at: https://jacky.wtf/2023/11/JeNH)
The parsing spec does not currently include any special handling of URLs in the
html
value for e-* microformats. From http://microformats.org/wiki/microformats2-parsing#parsing_an_e-_property:However, some of the microformats tests do resolve relative URLs. See for example:
The major libraries are somewhat split on this. PHP and Ruby do resolve relative URLs. Go, Python, and Node do not resolve relative URLs.
Recent discussion in #microformats was inconclusive (though we didn't explore it too deeply)
At the very least, we need to synchronize the spec and the test cases. Personally, I'm leaning toward updating the spec to resolve relative URLs, since otherwise they are useless in any kind of embedding use-case, and they may not actually be able to be resolved, since you no longer have the
<base>
element.