microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

Parser output for elements with property class and root class names? #51

Open jgarber623 opened 4 years ago

jgarber623 commented 4 years ago

Following up on a conversation I started in chat today, I'd like to clarify a section in the parsing spec related to generating output for parsed elements containing both property class and root class names.

The wording from section 1.2 of the parsing spec (emphasis added):

  • parse a child element for microformats (recurse)
    • if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure, add to that { } structure:
    • value:
      • if it's a p-* property element, use the first p-name of the h-* child
      • *else if it's an `e-property element, re-use its{ }structure with existingvalue:` inside.**
      • else if it's a u-* property element and the h-* child has a u-url, use the first such u-url
      • else use the parsed property value per p-*, u-*, dt-* parsing respectively

The test suite includes test cases for p-* and u-* (see microformats-v2/h-entry/impliedvalue-nested.html, for instance) properties, but I couldn't find a test case against an e-* property whose element also had a root class name.

I interpret "re-use its { } structure with existing value: to mean that the nested item's value should be set to the hash structure. That would result in something like:

"value": {
  "html": "…",
  "value": "…"
}

Current Behavior

Using a contrived markup example like:

<div class="h-entry">
  <div class="e-content h-card">
    <p class="p-name">Jason Garber</p>
  </div>
</div>

…parsers currently output results like:

{
  "items": [
    {
      "type": ["h-entry"],
      "properties": {
        "content": [
          {
            "type": ["h-card"],
            "properties": {
              "name": ["Jason Garber"]
            },
            "html": "<p class=\"p-name\">Jason Garber</p>",
            "value": "Jason Garber"
          }
        ]
      }
    }
  ]
}

Expected Behavior

Using the same markup example, and by my interpretation of the specification, I'd expect output like:

{
  "items": [
    {
      "type": ["h-entry"],
      "properties": {
        "content": [
          {
            "type": ["h-card"],
            "properties": {
              "name": ["Jason Garber"]
            },
            "value": {
              "html": "<p class=\"p-name\">Jason Garber</p>",
              "value": "Jason Garber"
            }
          }
        ]
      }
    }
  ]
}

Proposals?

Which of the above is a correct interpretation of the spec? Existing evidence from parsers and the non-authoritative microformats2-json wiki page point to those being the correct interpretation despite the unclear wording in the spec.

Is that the consensus of the community? If so, we should find a way to re-word the spec. If not, we should find a way to re-word the spec.

Thanks for reading! Looking forward to feedback.

gRegorLove commented 4 years ago

I think the current behavior listed above results in a more consistent result for consumers, with html and value appearing in a consistent location and value always being a string.

aimee-gm commented 4 years ago

So, it turns out that because this isn't included in the test suite, I managed to skip that line in the specification.

I don't want to get too involved in what the values should be (I would like to know though!), but a couple of comments:

Take the markup:

<div class="h-entry">
  <img class="u-photo h-card" alt="My name" src="/photo.jpg">
</div>

Looking at the specification:

else if it's a u- property element and the h- child has a u-url, use the first such u-url

The photo above doesn't have a url property, so it falls back to the photo property from:

else use the parsed property value per p-,u-,dt-* parsing respectively

As it has no nested u-photo, it becomes an implied photo, whose value comes from:

if img.h-x[src], then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo

Which means it should be: { value: "...", alt: "..." }. This then becomes the complete value of the h-card based on the above specification.

Expected output

{
      "type": ["h-entry"],
      "properties": {
        "photo": [
          {
            "type": ["h-card"],
            "properties": {
              "name": ["My name"],
              "photo": [
                { "alt": "My name", "value": "http://example.com/photo.jpg" }
              ]
            },
            "value": {
              "alt": "My name",
              "value": "http://example.com/photo.jpg"
            }
          }
        ]
      }
    }

Here, the PHP parse at microformats.io doesn't parse the alt at all at any level here, I believe incorrectly, so I've omitted it's output.

Again, the contents of value would no-longer be a string. How should these be handled?

The way I've decided to interpret this is to take the value out of the nested property.

gRegorLove commented 4 years ago

The root element will now have a html property - this is described no-where in the specification so cannot be expected to be there.

I'm not sure I understand this part. What do you mean by root element? I would expect the parsed content property to have an html property in both cases.

In the common e-content example:

<div class="h-entry">
<div class="e-content"><p>This is the content</p></div>
</div>

The parsed result is:

"items": [
    {
        "type": [
            "h-entry"
        ],
        "properties": {
            "content": [
                {
                    "html": "<p>This is the content</p>",
                    "value": "This is the content"
                }
            ]
        }
    }
]

Adding a nested h-card:

<div class="h-entry">
<div class="e-content h-card"><p>This is the content</p></div>
</div>

I would expect the parse to be:

"items": [
    {
        "type": [
            "h-entry"
        ],
        "properties": {
            "content": [
                {
                    "type": [
                        "h-card"
                    ],
                    "properties": {
                        "name": [
                            "This is the content"
                        ]
                    },
                    "html": "<p>This is the content</p>",
                    "value": "This is the content"
                }
            ]
        }
    }
]

Images are a special case where if there's an alt, the parsed result will be an object, otherwise a string. (alt parsing is in php-mf2 master branch and hopefully will be in a new release soon.)