microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

Is there a definitive version of a type array? #22

Closed Zegnat closed 6 years ago

Zegnat commented 6 years ago

Say we have this HTML snippet:

<div class="h-entry h-cite h-entry"></div>

What would you expect the following step in the parsing spec to return?

  • type: [array of microformat "h-*" type(s) on the element],

The PHP parser gives us the array in alphabetical order:

"type": [
  "h-cite",
  "h-entry",
  "h-entry"
]

While the Go and Python parsers stick to the order as given:

"type": [
  "h-entry",
  "h-cite",
  "h-entry"
]

In addition to this I would want to ask if people expect this array to give unique classes only or not? Is there any use to returning [ "h-entry", "h-entry" ]?

Or maybe none of this needs to be defined and the answer to the question in the topic is just “an unordered list of classes starting in h-”.

kartikprabhu commented 6 years ago

I don't see any use for this to be defined with this precision. I always have thought of it as “an unordered list of classes starting in h-”

Zegnat commented 6 years ago

I always have thought of it as “an unordered list of classes starting in h-”

Maybe that would be a better description then? The way it specifies “"h-*" type(s)” rather than e.g. classes made me think it meant unique values, but I may be the only one who read it that way. (Which in itself was a reason for me to open this issue.)

There is also the question of what you mean by “classes”. Do you mean anything in the class attribute on the HTML element, or everything in the DOM classList?

The DOM classList property will only list unique classes in source order and is very specific about that. This because classList returns a DOMTokenList which in its turn is an ordered set created through parsing the element’s class attribute. (My investment in the DOM spec may be another reason why I thought unique items would make sense.)

Here is a quick comparison between using the DOM method or doing your own string manipulation on the class attribute:

let output = []
for (let value of element.classList) {
  if (value.substr(0, 2) === 'h-') {
    output.push(value)
  }
}
// output === [ "h-entry", "h-cite" ]
let output = []
for (let value of element.getAttribute('class').split(/[\x09\x0A\x0C\x0D\x20]+/)) {
  if (value.substr(0, 2) === 'h-') {
    output.push(value)
  }
}
// output === [ "h-entry", "h-cite", "h-entry" ]

I feel like following the DOM specification and having the HTML parser handle parsing the attribute into a token list would be a good move for microformats. You are going to have to dive into the specification to figure out how to split the class attribute value anyway.

Following the spec also gives us ordered lists that should be the same between implementations. WHATWG specifically calls out interoperability as a reason for using ordered lists as often as possible (from the ordered set link above):

Almost all cases on the web platform require an ordered set, instead of an unordered one, since interoperability requires that any developer-exposed enumeration of the set’s contents be consistent between browsers. In those cases where order is not important, we still use ordered sets; implementations can optimize based on the fact that the order is not observable.

kartikprabhu commented 6 years ago

Again not really sure this is relevant in practice.

Zegnat commented 6 years ago

Again not really sure this is relevant in practice.

It probably isn’t relevant for parsers. But it is relevant for tests and things like a JSON schema for validating microformats in JSON (e.g. for Micropub). If the spec does not define what the type collection looks like, how are you going to know whether it is valid in the first place?

kartikprabhu commented 6 years ago

Maybe people who understand validating mf2 can give more input. I am not sure that validating mf2 outputs from parsers makes much sense.

Zegnat commented 6 years ago

As came up today, per the latest JSON spec RFC 8259:

An array is an ordered sequence of zero or more values.

As opposed to:

An object is an unordered collection of zero or more name/value pairs, […]

If we were to strictly compare the JSON output of different parsers and compare arrays with order intact, they will be in conflict with each other.

Zegnat commented 6 years ago

This issue is mostly superseded by #29 and #30. The first addresses order, the latter duplicates. Unlike this issue, they are about all microformats arrays rather than just type.

Zegnat commented 6 years ago

29 and #30 have been closed and now define exactly how a type array should be returned: unique items, sorted alphabetically.