microformats / php-mf2

php-mf2 is a pure, generic microformats-2 parser for PHP. It makes HTML as easy to consume as JSON.
Creative Commons Zero v1.0 Universal
193 stars 38 forks source link

properties with invalid names are not properly ignored #249

Closed sknebel closed 8 months ago

sknebel commented 1 year ago

in v0.5.0, the following example (distilled down from https://catgirlin.space/ as of right now)

<div class=" h-feed pt-2">
  <div class="p-4 h-entry">
    <a class="u-url" href="https://example.com/bar">bar </a>
  </div>
  <div class="p-4 h-entry">
    <a class="u-url" href="https://example.com/foo">foo</a> 
  </div>
</div>

parses as

{
    "items": [
        {
            "type": [ "h-feed"],
            "properties": [
                [
                    {
                        "type": ["h-entry" ],
                        "properties": {
                            "url": [ "https://example.com/bar"],
                            "name": [ "bar" ]
                        },
                        "value": "bar"
                    },
                    {
                        "type": [ "h-entry"],
                        "properties": {
                            "url": ["https://example.com/foo" ],
                            "name": ["foo"]
                        },
                        "value": "foo"
                    }
                ]
            ]
        }
    ],
…

Note how properties of the h-feed has been turned into two nested lists somehow, which is completely invalid output.

Potentially related, when nesting the two h-entries, somehow a property named 0 gets generated, which is also badly wrong (note that no 0 appears in the input)

<div class=" h-feed pt-2">
  <div class="p-4 h-entry">
    <a class="u-url" href="https://example.com/bar">bar </a>
    <div class="p-4 h-entry">
      <a class="u-url" href="https://example.com/foo">foo</a>   
    </div>
  <div>
</div>

parsed as

{
    "items": [
        {
            "type": [ "h-feed"],
            "properties": [
                [
                    {
                        "type": ["h-entry"],
                        "properties": {
                            "url": ["https://example.com/bar"],
                            "0": [
                                {
                                    "type": ["h-entry"],
                                    "properties": {
                                        "url": ["https://example.com/foo" ],
                                        "name": [ "foo" ]
                                    },
                                    "value": "foo"
                                }
                            ]
                        },
                        "value": "bar foo"
                    }
                ]
            ]
        }
    ],

Maybe related to #246

aaronpk commented 1 year ago

I don't think we have any examples of microformats vocabularies where the value is only a number, and I thought that actually wasn't allowed in the mf2 spec, but now I can't confirm that. In any case, maybe we can update the spec to ignore p-* where * is only [0-9]? I guess this is an issue for the parsing spec not this library, but it might be a simpler way to "fix" the library.

gRegorLove commented 1 year ago

I believe it's already ignoring class names with numbers. I'll have to take a deeper look soon, but my guess is it's parsing the nested h-entry and trying to add it as a property like: ['property-name' => 'nested-microformat'], but since the property name is ignored it results in a weird structure like [null => 'nested-microformat'], which I think PHP would coerce into [0 => 'nested-microformat'] -- instead of adding it to the parsed children

sknebel commented 1 year ago

@aaronpk

I guess this is an issue for the parsing spec not this library,

The parsing spec already clearly makes such names invalid, hence it is a parser issue to not silently ignore such classes

aaronpk commented 1 year ago

Oh I misread this: https://microformats.org/wiki/microformats2-parsing#parse_an_element_for_class_microformats

The "*" for root (and property) class names consists of an optional vendor prefix (series of 1+ number or lowercase a-z characters i.e. [0-9a-z]+, followed by '-'), then one or more '-' separated lowercase a-z words.

So a vendor prefix can be a number, so p-1-a would be valid, but p-1 is not.

gRegorLove commented 1 year ago

I think I have a working fix for this. Running some more tests and will submit a PR.