microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

reduce instances when p-name is implied #6

Closed tantek closed 6 years ago

tantek commented 7 years ago

This is a split-off of part of http://microformats.org/wiki/microformats2-parsing-issues#implied_properties_when_an_explicit_class_is_provided that was left unresolved without consensus (since that issue was resolved with just dealing with u-url and nothing more).

Experience has shown there are a number of instances where implied p-name produces something unhelpful. Typically this happens with otherwise large microformats which for some reason omitted the name (e.g. an h-feed, or h-entry that has no author supplied name etc.).

The use-case for which implied p-name was designed was for small microformats, e.g. just a hyperlink with h-* class on it, or maybe just a simple set of nested elements (without siblings). It has some similar use-cases too. But by the point you have elements with multiple explicit properties specified inside, rarely does the current implied p-name rule produce anything useful.

One thing that would help is links to specific examples where excessive or otherwise "useless" p-names are being implied, where no p-name would actually be preferable (from a consuming code point of view).

tantek commented 7 years ago

From the end of the wiki discussion, one straw proposal was:

"any explicit p-* property on an element stops implied p-name"

(this sounds a bit ambiguous and could be reworded, but I think the general intent / principle is workable)

aaronpk commented 7 years ago

I'm not clear on what that proposal actually means. Does "stop" mean no implied name is generated at all? Does it mean that property with p-* is excluded from the implied name?

tantek commented 7 years ago

Either possibility could be pursued. Let's start documenting examples of existing failures / misbehaviors so we can look at specifics.

aaronpk commented 7 years ago

Here is an example that came up today: https://huffduffer.com/tags/indieweb

This boils down to a structure like below:

<ol class="h-feed">
  <li class="h-entry">
    <h3 class="p-name">Episode Name</h3>
    <div class="e-content">episode description...</div>
    <audio class="u-audio" src="..."></audio>
    <p class="p-author h-card">...</p>
  </li>
</ol>

The result of the current implied p-name rule is all the contents of the individual h-entrys end up being in the implied p-name of the h-feed.

mblaney commented 7 years ago

Would it make sense for implied p-name to skip content from h-* children? It would mean in the above case that the implied p-name for the above h-feed is empty.

tantek commented 6 years ago

Restating for clarity, and adding children as another way to stop implying p-name

"p-name MUST NOT be implied if there are any explicit p-* properties or any nested microformats"

Need consensus positive feedback from parser developers and one proof of implementation to proceed (and no objections obviously). Yes you can thumbs-up to indicate positive feedback 😃

gRegorLove commented 6 years ago

Here's a Bridgy example: https://brid-gy.appspot.com/comment/twitter/miklb/948601132397588481/949306079623766016

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "author": [
                    {
                        "type": [
                            "h-card"
                        ],
                        "properties": {
                            "uid": [
                                "tag:twitter.com,2013:deaaqua"
                            ],
                            "numeric-id": [
                                "26367081"
                            ],
                            "name": [
                                "Amanda Rocksalot"
                            ],
                            "nickname": [
                                "deaaqua"
                            ],
                            "url": [
                                "https://www.instagram.com/deaaqua/",
                                "https://twitter.com/deaaqua"
                            ],
                            "photo": [
                                "https://pbs.twimg.com/profile_images/1811120958/Buttercups_and_Books_by_pinkparis1233.jpg"
                            ]
                        },
                        "value": "Amanda Rocksalot"
                    }
                ],
                "category": [
                    {
                        "type": [
                            "h-card"
                        ],
                        "properties": {
                            "uid": [
                                "tag:twitter.com,2013:GatoIshwary"
                            ],
                            "name": [
                                "Gato Ishwary"
                            ],
                            "url": [
                                "https://twitter.com/GatoIshwary"
                            ]
                        },
                        "value": "https://twitter.com/GatoIshwary"
                    },
                    {
                        "type": [
                            "h-card"
                        ],
                        "properties": {
                            "uid": [
                                "tag:twitter.com,2013:miklb"
                            ],
                            "name": [
                                "Michael Bishop"
                            ],
                            "url": [
                                "https://twitter.com/miklb",
                                "https://miklb.com/"
                            ]
                        },
                        "value": "https://twitter.com/miklb"
                    }
                ],
                "uid": [
                    "tag:twitter.com,2013:949306079623766016"
                ],
                "url": [
                    "https://twitter.com/deaaqua/status/949306079623766016"
                ],
                "video": [
                    "https://video.twimg.com/tweet_video/DSyc1j3VQAswKgR.mp4"
                ],
                "in-reply-to": [
                    "https://twitter.com/GatoIshwary/status/949250988946395137",
                    "https://miklb.com/2018/01/3012/"
                ],
                "published": [
                    "2018-01-05T15:46:21+00:00"
                ],
                "name": [
                    "tag:twitter.com,2013:949306079623766016\n  \n  2018-01-05T15:46:21+00:00\n      Amanda Rocksalot\n\n    deaaqua\n    \n\n  https://twitter.com/deaaqua/status/949306079623766016\n  \n  \n  \n   \nYour browser does not support the video tag. Click here to view directly. \n     Gato Ishwary\n    \n    \n  \n\n  \n     Michael Bishop"
                ]
            }
        }
    ],
    "rels": {},
    "debug": {
        "package": "https://packagist.org/packages/mf2/mf2",
        "version": "v0.3.2",
        "note": [
            "This output was generated from the php-mf2 library available at https://github.com/indieweb/php-mf2",
            "Please file any issues with the parser at https://github.com/indieweb/php-mf2/issues"
        ]
    }
}
aaronpk commented 6 years ago

I encountered a similar case today working on https://github.com/aaronpk/XRay/issues/52 where a photo property seemed to come out of nowhere.

In this example, I would not expect the photo property to be implied:

<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg"></p></div>

From the authoring perspective, I would expect the act of defining the e-content class should prevent the img tag inside from turning into an implied property on the parent h-entry.

Zegnat commented 6 years ago

The implied photo in @aaronpk’s example is done through the rule:

  • else if .h-x>:only-child:not[.h-*]>img[src]:only-of-type:not[.h-*], then use that img’s src for photo

This rule looks pretty complex. It seems to exist mostly in case of some wrapper element being inserted between the root h-x and the data we want to use for extracting implied values. That’s what I assume to be the original use case because the in-between element has to be the only element within the root h-x.

Either the existence of explicit properties should stop the implied parsing, or the rules should incorporate some form of limitation in case of the wrapper element. Adding :not[.p-*]:not[.u-*]:not[.dt-*]:not[.e-*] to the :only-child in-between element selector would be an option here, but I personally don’t like making these implied values rules even bigger and more convoluted. They are already a source of many questions.

Zegnat commented 6 years ago

Just a thought I just had in chat when I realised e-content wasn’t going to stop a name property to be implied per @tantek’s proposed new limitation:

"p-name MUST NOT be implied if there are any explicit p-* properties or any nested microformats"

Is there any reason why we limit this to p- properties only and not all explicit properties? I still would not be able to remove the redundant p-name from my blog posts. Here is my HTML:

    <article class="h-entry">
      <main class="e-content p-name">
        <p>Post</p>
      </main>
      <footer>
        <p>Published on <time class="dt-published" datetime="2017-11-07T13:27:49+01:00">7<sup>th</sup> November 2017 13:27:49</time>.</p>
                <p class="editorlink">[<a rel="edit" href="#">edit</a>] [<a href="#" class="u-url u-uid">permalink</a>]</p>
      </footer>
    </article>

This h-entry includes an e- property, a dt- property, and two u- properties. So without p-name there are no other p- properties to stop the name from being implied.

Zegnat commented 6 years ago

Tantek’s home page is another example of h-entry markup that will need to keep explicitly setting p-name because it would get implied wrong otherwise:

<li class="h-entry hentry as-note">
<p class="p-name entry-title e-content entry-content article">Post</p>
<span class="info footer"><a href="#" class="dt-published published dt-updated updated u-url u-uid"><time class="value" datetime="05:11-0800">05:11</time> on <time class="value">2018-01-14</time></a></span>
</li>

It (like mine) has several other properties defined for a post but no additional p- properties.

Zegnat commented 6 years ago

Today I spotted another live example of a post where the parsed name will not be fixed by the proposed change. Hat tip to @gRegorLove, who linked to Upon 2020 first. See chat. The post [he linked to]() includes a p-in-reply-to and will indeed be fixed by the proposed change. But…

@jernst’s website uses the following HTML for note posts (edited to remove non-mf2 classes and emty unrelated attributes for easier reading):

<li id="" class="h-entry">
  <a href="" title="Posts by jernst ( @jernst )" class="">
    <img alt='' src='' srcset='' class='' height='48' width='48' />
  </a>
  <h4>
    <a href="" title="Posts by jernst ( @jernst )">jernst</a>
    <span class="">
      <abbr class="dt-published" title="2018-01-24T16:56:18Z">
        08:56 <em>on</em> January 24, 2018
      </abbr>
      <span class="">
        <a href="" class="u-url" title="Permalink">Permalink</a>
        <a rel="nofollow" class="" href="">Log in to leave a Comment</a>
      </span>
      <span class="">&nbsp;</span>
    </span>
  </h4>
  <div id="" class="e-content">
    <p>
      My &#8220;invited guest post&#8221; (is that a thing?) on opportunities
      of the #IndieWeb for small businesses went live on GoDaddy&#8217;s site:
      <a href="" rel="nofollow">https://www.godaddy.com/garage/indieweb-facebook-opportunities/</a>
    </p>
  </div>
  ⋮
</li>

This has a u-url, dt-published, and e-content. Even under the current proposal, as there is no other p- property, the name will be implied and include meta data. (As shown by Loqi).

Also note that if the reply post linked by @gRegorLove had used u-in-reply-to it wouldn’t be fixed either. This is a great example of why I do not think we should limit it to p- properties. That post would not be expressing any difference in intent between u-in-reply-to and p-in-reply-to but the parsers will assign it an implied name in one case and not for the other. I think this will lead to extra confusion.

kevinmarks commented 6 years ago

Currently all mf2 items have a name property, because of implied name. If we change this, some code using the parsers could fail if it assumes a name is present. Not sure if this a key issue.

Zegnat commented 6 years ago

Documenting from yesterday’s chat, because nobody could remember this and I can’t find it elsewhere:

So I think the (as of now) latest proposed spec change would be:

Currently all mf2 items have a name property, because of implied name. If we change this, some code using the parsers could fail if it assumes a name is present.

Here too I will just document an answer given in chat, this time by @aaronpk:

may just mean it has to be released as a major version number [of the parsers]

Assuming something like semver is being used, any major version bump should signify possible API changes to the user. I too don’t think that would be an issue.

There might be an issue if someone is using parsers-as-a-service, e.g. always getting their mf2 parser output from php.microformats.io. But I don’t think anyone ever advertised their online parsers as a service?

aaronpk commented 6 years ago

I (and likely others) use xray.p3k.io as a service, so I will have to consider what to do in that case. It doesn't return the Microformats JSON, it converts it to its jf2 format first. I may just return an empty string for name if there is no mf2 name property.

Zegnat commented 6 years ago

It doesn’t return the Microformats JSON, it converts it to its jf2 format first.

It might be worth opening an issue on jf2 to see if they want to keep an explicit name property. Their “author” syntax says it “consists of a name” and more, but that’s not marked as a MUST. For a jf2feed on the other hand name is a SHOULD.

The real question is, do you see any reasons for postponing this change because of your use of a mf2 parser as a service? I think not?

aaronpk commented 6 years ago

Nope, wasn't intending to hold things up, just wanted to put that there for the record.

I agree with the current proposal of having p-* and e-* and h-* stop the implied name.

kartikprabhu commented 6 years ago

Is there a consensus on this issue? If yes, I can look into adding this to mf2py.

Zegnat commented 6 years ago

At @kartikprabhu’s request here are some super simplified examples of HTML where unwanted metadata (a footer of an article) would be swept up in the name even though I only want content or like-of to show. Before and afters have been included. Output taken from the PHP parser.

Case with e-content and no p-name

HTML

<article class="h-entry">
  <div class="e-content">
    <p>Wanted content.</p>
  </div>
  <footer>
    <p>Footer to be ignored.</p>
  </footer>
</article>

Current output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "content": [
                    {
                        "html": "\r\n    <p>Wanted content.</p>\r\n  ",
                        "value": "Wanted content."
                    }
                ],
                "name": [
                    "Wanted content. \r\n   \r\n  Footer to be ignored."
                ]
            }
        }
    ],
    "rels": {}
}

Expected output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "content": [
                    {
                        "html": "\r\n    <p>Wanted content.</p>\r\n  ",
                        "value": "Wanted content."
                    }
                ]
            }
        }
    ],
    "rels": {}
}

Case with p-content and no p-name

HTML

<article class="h-entry">
  <div class="p-content">
    <p>Wanted content.</p>
  </div>
  <footer>
    <p>Footer to be ignored.</p>
  </footer>
</article>

Current output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "content": [
                    "Wanted content."
                ],
                "name": [
                    "Wanted content. \r\n   \r\n  Footer to be ignored."
                ]
            }
        }
    ],
    "rels": {}
}

Expected output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "content": [
                    "Wanted content."
                ]
            }
        }
    ],
    "rels": {}
}

Case with nested microformat and no p-name

HTML

<article class="h-entry">
  <div class="u-like-of h-cite">
    <p>I really like <a class="p-name u-url" href="http://microformats.org/">Microformats</a></p>
  </div>
  <footer>
    <p>Footer to be ignored.</p>
  </footer>
</article>

Current output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "like-of": [
                    {
                        "type": [
                            "h-cite"
                        ],
                        "properties": {
                            "name": [
                                "Microformats"
                            ],
                            "url": [
                                "http://microformats.org/"
                            ]
                        },
                        "value": "http://microformats.org/"
                    }
                ],
                "name": [
                    "I really like Microformats \r\n   \r\n  Footer to be ignored."
                ]
            }
        }
    ],
    "rels": {}
}

Expected output

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "like-of": [
                    {
                        "type": [
                            "h-cite"
                        ],
                        "properties": {
                            "name": [
                                "Microformats"
                            ],
                            "url": [
                                "http://microformats.org/"
                            ]
                        },
                        "value": "http://microformats.org/"
                    }
                ]
            }
        }
    ],
    "rels": {}
}
kartikprabhu commented 6 years ago

Implemented in mf2py here https://github.com/kartikprabhu/mf2py/tree/implied-name-fix Will push to the main version if this gets consensus and makes it to the spec.

Zegnat commented 6 years ago

With an implementation for mf2py by @kartikprabhu and an intent to implement in mf2php by @gRegorLove, is that enough to update the spec? @tantek?

kartikprabhu commented 6 years ago

it seems the output from mf2py depends on which internal HTML parser is being used ( see: https://github.com/kartikprabhu/mf2py/issues/58#issuecomment-366533803 ), but this should count as an internal bug for mf2py to be fixed.

So I am +1 for this being included in the spec.

kartikprabhu commented 6 years ago

implied-name-stopping has been implemented in an experimental version of mf2py.

so +1

Zegnat commented 6 years ago

This has been added to the spec, at revision 20:27, 4 March 2018. Resolved.