microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

`rel-urls` Parsing Issues #50

Open jgarber623 opened 4 years ago

jgarber623 commented 4 years ago

Section 1.4 of the microformats2 parsing specification outlines how to parse link elements (<a>, <link>, etc.) for rel values and defines the JSON output structure.

The rels structure is reasonably straightforward and maps one-to-one with matched elements:

<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
   href="http://example.com/fr"
   media="handheld"
   hreflang="fr">French mobile homepage</a>

…results in…

{
  "rels": { 
    "author": [ "http://example.com/a", "http://example.com/b" ],
    "in-reply-to": [ "http://example.com/1", "http://example.com/2" ],
    "alternate": [ "http://example.com/fr" ],
    "home": [ "http://example.com/fr" ]
  }
}

The parsing rules break down slightly when compiling results for the rel-urls structure. For each unique URL, the resulting JSON hash should include a key rels whose value is an array of strings found across matched link elements. The spec also defines rules for parsing various attributes (hreflang, media, title, and type) and the node's text value. These extended attributes are specified as strings (not arrays), resulting in data loss and a seemingly inconsistent parsing pattern.

Parser Results

Parser developers have implemented this feature with differing results.

Given the markup:

<link rel="me" href="https://sixtwothree.org">

<a rel="me" href="https://sixtwothree.org">Jason Garber</a>
<a rel="home" href="https://sixtwothree.org">Go back home</a>

…the parsers provide differing result JSON.

Go

{
  "items": [],
  "rels": {
    "home": ["https://sixtwothree.org"],
    "me": ["https://sixtwothree.org"]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "rels": ["me"]
    }
  }
}

PHP

{
  "items": [],
  "rels": {
    "me": ["https://sixtwothree.org"],
    "home": ["https://sixtwothree.org"]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "text": "Jason Garber",
      "rels": ["home", "me"]
    }
  }
}

Python

{
  "items": [],
  "rels": {
    "me": ["https://sixtwothree.org"],
    "home": ["https://sixtwothree.org"]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "text": "",
      "rels": ["home", "me"]
    }
  }
}

Ruby

{
  "items": [],
  "rels": {
    "me": ["https://sixtwothree.org"],
    "home": ["https://sixtwothree.org"]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "rels": ["home"],
      "text": "Jason Garber"
    }
  }
}

Note: The Node parser on microformats.io appears to be offline.

So…

The test suite's rel tests appear to conform to the spec as its written today. What I'd like help sorting out is what seems like an arbitrary (or, at least undocumented) decision to only aggregate rel attribute values in the rel-urls result structure. The extended attributes are, per the spec, worth capturing, but not worth capturing as arrays. That seems strange.

Can someone shed some light on the subject and/or can we update the spec to be more clear or to change behavior?

Edit 1: #39 is tangentially related to this, as well.

Edit 2: #32 is also related to this.

gRegorLove commented 4 years ago

Here's the previous discussion and resolution.

My reading of the current spec is that this is correct for rel-urls:

{
  "items": [],
  "rels": {
    "me": ["https://sixtwothree.org"],
    "home": ["https://sixtwothree.org"]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "text": "Jason Garber",
      "rels": ["home", "me"]
    }
  }
}

Parsing the first link adds me to the rels; parsing the second adds the text property; parsing the third adds home to the rels.

Edit Just noticed that this does lose the text value of the third link since that's already set by the second one. Hm.

jgarber623 commented 4 years ago

Tagging @kevinmarks and @sknebel on this one.

Building on something Kevin mentioned in chat, say you're viewing a blog post in a Web browser and the page advertises alternate versions available at the same URL but with responses dictated by the incoming request's Accept header:

<link rel="alternate" href="https://sixtwothree.org/posts/877-days" type="application/json">
<link rel="alternate" href="https://sixtwothree.org/posts/877-days" type="text/markdown">

The above example is a modified version of some markup I have on my own website. curl-able by issuing the following commands:

curl -H 'Accept: application/json' https://sixtwothree.org/posts/877-days
curl -H 'Accept: text/markdown' https://sixtwothree.org/posts/877-days

With the aforementioned parsers on microformats.io, you'd miss out on the text/markdown alternate version because the types key in the rel-urls structure is a simple string, not an aggregate array of matched values.

The same would be true of hreflang, media, etc. but the use case for that data is a little less obvious to me.

aimee-gm commented 4 years ago

@jgarber623 thanks for raising this. I too found this ambiguous while implementing a parser.

The output of https://aimee-gm.github.io/microformats-parser/ (a JavaScript parser) is:

{
  "rels": {
    "me": [
      "https://sixtwothree.org"
    ],
    "home": [
      "https://sixtwothree.org"
    ]
  },
  "rel-urls": {
    "https://sixtwothree.org": {
      "rels": [
        "me",
        "home"
      ],
      "text": ""
    }
  },
  "items": []
}

I also agree with @gRegorLove that this should have a non-empty string text value.