microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

resolve relative URLs in e-* html #38

Open willnorris opened 6 years ago

willnorris commented 6 years ago

The parsing spec does not currently include any special handling of URLs in the html value for e-* microformats. From http://microformats.org/wiki/microformats2-parsing#parsing_an_e-_property:

html: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing whitespace removed.

However, some of the microformats tests do resolve relative URLs. See for example:

The major libraries are somewhat split on this. PHP and Ruby do resolve relative URLs. Go, Python, and Node do not resolve relative URLs.

Recent discussion in #microformats was inconclusive (though we didn't explore it too deeply)

At the very least, we need to synchronize the spec and the test cases. Personally, I'm leaning toward updating the spec to resolve relative URLs, since otherwise they are useless in any kind of embedding use-case, and they may not actually be able to be resolved, since you no longer have the <base> element.

aaronpk commented 6 years ago

The lack of the <base> element in the parsed result, as well as the fact that some parsers already do this, make me lean towards adding it to the spec as well.

Zegnat commented 6 years ago

I agree. My first reaction was to keep the HTML as close to the original as possible (including relative URLs), but the embedding use-case won me over! The relative URLs cannot stand on their own in the HTML fragment, as we would have lost meaning. Resolving them makes a lot of sense!

Zegnat commented 6 years ago

Because these issues tend to grow stale if there is no concrete proposal, here is one.

I propose replacing this line in the parsing specification for e-*:

With:

The attributes table in the HTML specification is one of the few resources that tells us where URLs are, without having to leave the parser implementer guessing about what strings in the HTML fragment constitute relative URLs. This came out of a discussion with regards to the Webmention spec (w3c/webmention#91).

The “… following the containing document’s language’s rules …” phrasing is taken straight from other places where we resolve URLs.

I have kept the “with” phrasing because I am not sure how to best sum up the actions that need to be taken by the parser. Most implementers will probably first resolve the URLs, then run the serialisation algorithm, and then remove whitespace. That’s a slightly different order from what this proposal suggests. Let me know if there are better ways to phrase this.

sknebel commented 6 years ago

Looks good.

Can anyone think of use-cases that would prefer URLs not to be normalized? Theoretical example I can come up with: recovery of content from a site with mf2 markup, where relative URLs to pictures, other posts, ... might be preferred.

If that's an issue, the parser should at least return the found base URL for the page, so later steps can resolve if they want. In the majority of cases, the HTML from the e-* properties has to be postprocessed anyways (filtering safe tags, replacing images with proxied versions, ...), and resolving would then add "just" another step to that.

gRegorLove commented 6 years ago

Sounds good. I'm +1 on @Zegnat's proposal.

willnorris commented 6 years ago

@Zegnat's proposal has been implemented in the go parser, though it is not currently enabled.

snarfed commented 1 year ago

This came up recently in https://github.com/snarfed/bridgy-fed/issues/390#issuecomment-1408879534. php-mf2 currently resolves relative URLs in e-*, mf2py doesn't. Shall we try again to get @Zegnat's proposal ^ into the spec? Also, mf2py and other parser contributors, any thoughts? @sknebel @tommorris @angelogladding

jgarber623 commented 1 year ago

MicroMicro resolves relative URLs within e-* properties. I don’t remember building it that way for any other reason than to pass the microformats2 test suite.

sknebel commented 1 year ago

mf2py has this ready to go (and thus I guess votes in favor of this change)

jalcine commented 1 year ago

This is supported by the Rust Microformats parser and is demonstrated in the tested documentation, as this is done generically with plain text.

(Originally published at: https://jacky.wtf/2023/11/u1y4)

jalcine commented 1 year ago

I'm also in favor of this change!

(Originally published at: https://jacky.wtf/2023/11/JeNH)