microformats / php-mf2

php-mf2 is a pure, generic microformats-2 parser for PHP. It makes HTML as easy to consume as JSON.
Creative Commons Zero v1.0 Universal
185 stars 37 forks source link

Experimental language parsing #96

Open voxpelli opened 8 years ago

voxpelli commented 8 years ago

It would be valuable to get a working proof of concept of language parsing built for one of the mf2-parsers and the php-mf2 library along with the javascript one are two good candidates for that.

The discussion around language parsing is happening here: http://microformats.org/wiki/microformats2-parsing-brainstorming#Parse_language_information

There's a similar issue as this in the javascript MF2 parser here: https://github.com/glennjones/microformat-shiv/issues/22 And the original PR to create proof of concept for an old version of the javascript mf2 parser can be found here: https://github.com/glennjones/microformat-node/pull/23

To achieve the language parsing in php-mf2 one can probably utilize the fact that a DOMNode has a parentNode property (see docs) and use that to traverse the document tree upwards until one reach the first lang= attribute or one reaches the end of the tree. Then one knows what the language of a node is (apart from some defaults that may have been specified in the eg. the HTTP-response, see HTML5 docs) and one can then know whether to add the language attribute or not.

Update: As @gRegorLove pointed out on IRC it may be hard to add the proposed output without breaking backwards compatibility, so the new output would either have to be introduced as a new major version or, probably preferably, as an opt-in feature flag for now that those who wants to use language data here and now can use while those who prefer to wait for a future major version before updating to support the new output could do so.

gRegorLove commented 8 years ago

I'm interested in working on this as i'm trying to add mf2 parsing to https://github.com/fguillot/picoFeed and it currently supports language detection for XML feeds.

Recent conversation: https://indiewebcamp.com/irc/2016-05-07#t1462646589527

A tricky scenario that @voxpelli raised with nested p-* and languages specific to them: https://indiewebcamp.com/irc/2016-05-07#t1462651125104

aaronpk commented 7 years ago

@gRegorLove I'm looking at the parsed result and it looks like it's including an html-lang property in the wrong place.

<div class="h-entry" lang="sv" id="postfrag123">
  <h1 class="p-name">En svensk titel</h1>
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>
{
    "type": [
        "h-entry"
    ],
    "properties": {
        "name": [
            "En svensk titel"
        ],
        "content": [
            {
                "html": "With an <em>english<\/em> summary",
                "value": "With an english summary",
                "html-lang": "en"
            },
            {
                "html": "Och <em>svensk<\/em> huvudtext",
                "value": "Och svensk huvudtext",
                "html-lang": "sv"
            }
        ],
        "html-lang": "sv"
    }
}

The html-lang property in the content is correct, but there's also an html-lang property inside properties which isn't what's described on the brainstorming page.

jkphl commented 7 years ago

Yeah ... had to solve this locally as well yesterday (kept busting interating over the properties by not providing an array).

aaronpk commented 7 years ago

I am moving the language parsing behind a feature flag until this is sorted out. That way you can opt in to have the language parsing happen, but must be aware that it's still experimental.

jkphl commented 7 years ago

Ok. I'm generally interested as other formats support languages as well. Still working on implementing it though.

gRegorLove commented 7 years ago

Oops. I'll add some explicit tests for that and work on the fix.

aaronpk commented 7 years ago

Fixed in #124!

I'll push out a new release with this change once #112 is done too!

gRegorLove commented 7 years ago

@aaronpk Before you push out a new release, will need to switch back to "html-lang" per https://chat.indieweb.org/microformats/2017-05-30/1496166813294000

Edit: disregard. Per later conversation, "lang" doesn't appear at the same level as any mf properties in the parsed results, so shouldn't cause conflicts.