Duplicate properties - Githubissues

JKingweb commented 1 year ago

Currently if an element has two or more of the same property, all major parsers add the property multiple times:

<div class="h-test">
  <div class="p-name p-name u-name">W</div>
</div>

{
  "items": [{
    "type": ["h-test"],
    "properties": {
      "name": [
        "W",
        "W",
        "http://example.com/W"
      ]
    }
  }],
  "rels": {},
  "rel-urls": {}
}

While it's good that implementations are consistent with each other, this is unfortunately inconsistent with other aspects of parsing:

Microformat types are deduplicated
Backcompat property collisions only result in one value (a rule implied by tests e.g. https://github.com/microformats/tests/blob/458c4c9ddf7321f6c02fd1db85622c26d3215ce9/tests/microformats-v1/hresume/work.html#L10 where both vcard fn and vevent summary map to p-name)

Do we want to deduplicate properties in v2 processing? If yes, how would collisions involving different prefixes be resolved? In my implementation I had implemented a simple ranking system with e- winning over u- winning over dt- winning over p-, before I realized other implementations did nothing at all to resolve the duplication.

Either way I intend to write tests to cover this.

gRegorLove commented 1 year ago

Interesting find. I think this is somewhat intentional since properties can be multi-valued, allowing publishers to put the same property on different elements, e.g. multi-photo posts:

<div class="h-entry">
  <img src="/photo1.jpg" class="u-photo">
  <img src="/photo2.jpg" class="u-photo">
</div>

I think a case could be made for de-duplicating class names on an individual element before following the parsing rules. A case could also be made that keeping the parsed duplicate will help publishers find likely mistakes in their markup. I don't have a strong opinion currently.

De-duplication in the spec could look something like:

"parse a child element class for property class name(s) "p-*,u-*,dt-*,e-*. If any are found, normalize the list of classes, then continue parsing the element"

It would need to be precise about the normalization process:

split the classList by space character
trim whitespace and newlines around each class name
compare class names in a case-sensitive manner and remove duplicates
use the normalized classList to continue parsing properties, e.g. "parsing a p- property," etc.

That's just off the top of my head. There's probably an HTML spec to reference for more precision on this.

This would change your example into this before parsing the p-*:

<div class="h-test">
  <div class="p-name u-name">W</div>
</div>

JKingweb commented 1 year ago

I think this is somewhat intentional since properties can be multi-valued, allowing publishers to put the same property on different elements, e.g. multi-photo posts:

Yes, I meant specifically the same property element, not the same microformat (where multiple of the same property on different elements is absolutely expected).

There's probably an HTML spec to reference for more precision on this.

There is, yes. It's not complicated, though it performs no deduplication itself; the salient detail is really the reference to splitting on whitespace, which in turn references the definition of whitespace. Note that the same logic should be used in evaluating link relations (both are DOMTokenLists).

I don't have a strong opinion currently.

I don't, either, for what it's worth. I do feel it needs to be specified one way or the other, though.

microformats / microformats2-parsing

Duplicate properties #61