microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

Explicitly remove surrounding spaces in parsed `u-*` values #48

Open aaronpk opened 4 years ago

aaronpk commented 4 years ago

There is currently an inconsistency in the PHP, Ruby and Python parsers regarding spaces in u-* values. The PHP and Ruby parsers will remove surrounding spaces from the value returned in u-* properties, but the Python parser does not.

Given this HTML:

<div class="h-card">
  <a href="  https://example.com/  " class="u-url p-name">Test</a>
</div>

PHP:

        {
            "type": [
                "h-card"
            ],
            "properties": {
                "name": [
                    "Test"
                ],
                "url": [
                    "https://example.com/"
                ]
            }
        }

Ruby

    {
      "type": [
        "h-card"
      ],
      "properties": {
        "url": [
          "https://example.com/"
        ],
        "name": [
          "Test"
        ]
      }
    }

Python

  {
   "type": [
    "h-card"
   ], 
   "properties": {
    "name": [
     "Test"
    ], 
    "url": [
     "  https://example.com/  "
    ]
   }
  }

The HTML spec says:

The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.

Since the Microformats parser is trying to return a URL value, it seems like removing the spaces is the correct behavior, even though that is not currently in the Microformats spec, which just says:

if a.u-x[href] or area.u-x[href] or link.u-x[href], then get the href attribute

http://microformats.org/wiki/microformats2-parsing#parsing_a_u-_property

I would like to propose a spec change to make it explicit that the parser should remove any surrounding spaces from the href attribute.

if a.u-x[href] or area.u-x[href] or link.u-x[href], then get the href attribute after removing all leading/trailing space characters

sknebel commented 4 years ago

Same applies to <img src=, <video src=, …

(Originally published at: https://www.svenknebel.de/posts/2020/3/2/)

gRegorLove commented 4 years ago

Related discussion about what the mf2 spec means by "normalized": https://github.com/microformats/microformats2-parsing/issues/9

I'm +1 for trimming the whitespace, though the spec change might need to be in the last bullet point ("return the normalized absolute URL...") to ensure it applies to all cases.

willnorris commented 4 years ago

+1 from me. I don't recall what the Go library does in this regard, but I'm happy to update it to match this spec change.

jgarber623 commented 4 years ago

+1 to @gRegorLove's note. I think the last bullet in the "parsing a u-* property" should be updated:

return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

…and/or whitespace stripping is implied in the existing text? I'd rather we be explicit, though.