microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

proposal: drop img src (and alt) from implied p-name #35

Closed snarfed closed 6 years ago

snarfed commented 6 years ago

right now, img src and alt values often get included in implied p-names (spec; added in #17).

names are usually expected to be relatively short and human readable, but image URLs are often long, not very human-readable, and add little to no value. aaronpk/XRay#53 is related and has details. also, as a downstream user myself, this recently surprised me when it was added to mf2py 1.1.0, and is undesirable enough that i can't realistically upgrade bridgy or granary to it with this behavior.

alt text is usually human readable, but often fairly disconnected and out of context from the rest of the implied name, which results in similarly poor quality text. #16 is related and has details.

a brief straw poll of #microformats this morning found many people who agree, and no one who wanted to keep src URLs or alt values in implied p-name. should we consider dropping that part of the spec?

as an example, this html:

<p class="h-card">
  My Name
  <img src="http://xyz" />
</p>

currently gets parsed to:

{
  "type": ["h-card"],
  "properties": {
    "name": ["My Name\n   http://xyz"],
    "photo": ["http://xyz"],
  }
}
sknebel commented 6 years ago

Agreed. I believe it was a mistake on my part to mention it in #17 (presumably I just looked at the name parsing), and nobody caught that this expanded the scope of the issue.

Zegnat commented 6 years ago

To summarise the previous change: we normalised text parsing to use the exact same wording for p-*, e-*’s value, and implied name. Originally to make sure <script> and <style> tags would get correctly dropped everywhere, but in the end also included mf2’s <img> replacement step. Parsers interpret @snarfed’s example HTML the same when it does and does not include a p-name class wrapping the whole thing:

<p class="h-card"><span class="p-name">
  My Name
  <img src="http://xyz" />
</span></p>

I previously thought this was fine and would limit the amount of “surprises” for people when expanding from implied to explicit parsing, etc. But seeing how there now seem to be multiple objects in the wild that introduce breakage specifically on implied names, it will probably correct to change this.

As @snarfed said, it doesn’t look like anyone is against this. I would propose one of the following spec changes:

  1. Remove the line that says to replace <img> elements completely from the textContent sub-step on implied names.
  2. Change the line to still allow replacing it with the matching alt attribute, but never its URL. Leaving only:

    replacing any nested <img> elements with their alt attribute, if present;

Zegnat commented 6 years ago

So I wasn’t convinced about dropping alt completely. Also to keep support for something like:

<a class="h-card" href="/">
  <img src="photo.jpg" alt="Zegnat">
</a>

But it turns out that the alt will always be used if the <img> is the only content of an element. This leaves only cases where part of the name is expressed by an image while another part is expressed by text. And I am not seeing the use-case for that.

Thanks @sknebel for walking my thoughts through this. 10 minutes of IRC chat log, starting here. I am now also fully on board for dropping the line about replacing <img> completely.

snarfed commented 6 years ago

there now seem to be multiple objects in the wild that introduce breakage specifically on implied names

to clarify, i'm against including src (and probably alt) in both explicit and implicit p-name. i don't want and couldn't use an mf2 parser that includes the URL in the parsed name for your example HTML with explicit p-name either.

if that means i should change the issue title, or otherwise changes the request, let me know. happy to discuss more!

Zegnat commented 6 years ago

to clarify, i'm against including src (and probably alt) in both explicit and implicit p-name.

For some context, reversing <img> replacement in implied names changes a spec addition made earlier this year and only removes a part of it that could be labelled accidental. I have no issues with doing that right now.

Dropping <img> from all explicit p-* parsing reverses a May 2013 spec revision. I am really surprised that this behaviour wasn’t already fully implemented in all parsers.

For explicit mark-up, I agree with @sknebel on IRC, spin that to another separate discussion. To quote:

sknebel commented 6 years ago

Apparently this was discussed at the IndieWebSummit. Could someone please look up the conclusion and document it here?

(@Zegnat has volunteered to listen to the session recording and do this)

gRegorLove commented 6 years ago

We agreed to @Zegnat's proposal from May 31: https://github.com/microformats/microformats2-parsing/issues/35#issuecomment-393615508

2+ implementor votes and mf2py implements this now, so I think we're good to update the spec.

https://youtu.be/0OUy7jjUN5k?t=2h3m35s

sknebel commented 6 years ago

Martijn clearly proposed two solutions, and from the change in mf2py the second one, which only partially reverts the change @snarfed has an issue with, got picked. Why that and not remove the entire line? Did people find a reason why keeping the alt case was important?

I'm happy to accept it if it is consensus, but I'd like to understand why this one was picked, and in the youtube video it seems only to be said "Accept Martijn's proposal", not which one.

gRegorLove commented 6 years ago

Oops, I should have linked the video earlier. I've transcribed some of it below for easier reference. We discussed and agreed on the second proposal.

Topic starts https://youtu.be/0OUy7jjUN5k?t=2h1m20s, then at 2:02:23:

Tantek: so we're just going to use the alt attribute then and not the source Ryan: for implied Tantek: only for implied? Ryan: Martijn's proposal was only for implied [Brief tangential discussion about implied p-name] Tantek: I think that proposed resolution is fine. It's a conservative change, so if we want to open a separate issue for getting rid of src in general, but let's solve the particular implied problem immediately

So I believe the agreed update is:

Under implied p-name parsing rules, replace:

replacing any nested <img> elements with their alt attribute, if present; otherwise their src attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;

With:

replacing any nested <img> elements with their alt attribute, if present;

gRegorLove commented 6 years ago

Updated in spec with this revision: http://microformats.org/wiki/index.php?title=microformats2-parsing&oldid=66871

This issue can be closed now.