microformats / microformats-parser

A JavaScript microformats parser for the browser and node.js
https://microformats.github.io/microformats-parser/
MIT License
55 stars 11 forks source link

Collapse space when implying `value` from `textContents` #51

Closed njkleiner closed 4 years ago

njkleiner commented 4 years ago

Describe the bug

When implying the value property for a nested microformat (e.g., h-adr inside h-entry) from the HTML textContents, multiple successive whitespace characters should be collapsed to a single space character.

To Reproduce

HTML input:

<div class="h-entry">
    <span class="p-location h-adr">
        <span class="p-locality">Berlin</span>,
        <span class="p-region">Berlin</span>,
        <span class="p-country-name">DE</span>
        <data class="p-latitude" value="52.518606"></data>
        <data class="p-longitude" value="13.376127"></data>
    </span>
</div>

Expected behavior

Correct JSON output:

{
    "items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "location": [
                    {
                        "type": [
                            "h-adr"
                        ],
                        "properties": {
                            "locality": [
                                "Berlin"
                            ],
                            "region": [
                                "Berlin"
                            ],
                            "country-name": [
                                "DE"
                            ],
                            "latitude": [
                                "52.518606"
                            ],
                            "longitude": [
                                "13.376127"
                            ]
                        },
                        "value": "Berlin, Berlin, DE"
                    }
                ]
            }
        }
    ],
    "rels": {},
    "rel-urls": {},
}

Actual JSON output:

{
  "rels": {},
  "rel-urls": {},
  "items": [
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "location": [
          {
            "type": [
              "h-adr"
            ],
            "properties": {
              "locality": [
                "Berlin"
              ],
              "region": [
                "Berlin"
              ],
              "country-name": [
                "DE"
              ],
              "latitude": [
                "52.518606"
              ],
              "longitude": [
                "13.376127"
              ]
            },
            "value": "Berlin,\n        Berlin,\n        DE"
          }
        ]
      }
    }
  ]
}

Note the difference Berlin, Berlin, DE vs. Berlin,\n Berlin,\n DE.

Additional context

From what I can tell, this is not actually part of the specification, it seems to be commonly accepted though, as both the PHP parser and the Python parser do this.

aimee-gm commented 4 years ago

Implementing this change causes several failures with the microformats test suite (https://github.com/microformats/tests/tree/master/tests/microformats-v2).

These tests also fail using the PHP parser.

I would like to use the microformats test suite as a "source of truth" for how a parser should work. I don't think this is a bug, but a behaviour change. It could be implemented behind an experimental toggle?

To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.

aimee-gm commented 4 years ago

The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?

njkleiner commented 4 years ago

Implementing this change causes several failures with the microformats test suite (microformats/tests:tests/microformats-v2@master).

I didn't realize that was the case. That's interesting considering that this behavior seems to be somewhat common.

I would like to use the microformats test suite as a "source of truth" for how a parser should work.

Definitely makes sense.

To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.

I've gone ahead and created a quick overview of how it's implemented across some projects. I think the pattern there is quite interesting.

Name Uses test suite Implements collapse whitespace
php-mf2 ✔️ (default)
mf2py ✔️ (default)
microformat-shiv ✔️ (feature)
micromicro ✔️
microformats-parser ✔️ ?

Given that it's the default behavior in some parsers and it's arguably useful, I think we should implement it (behind a feature flag, for now).

Also, I think there's definitely a discussion to be had about how this fits in with the official specification.

Zegnat commented 4 years ago

Background: we found that user expectations did not really match the parsing spec in all cases (e.g. when it comes to consequtive whitespace). This is being discussed as a spec issue.

PHP and Python both implement a version of an algorithm I wrote out. I am saying “a version of” as I am not actually completely sure on the details anymore and would not want to claim they match completely. (For even more complexity, there has been a try to find out what is needed to match browser specs more closely. Again in PHP and Python.)

The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?

I am cheating.

When running the tests from the test suite I default to the text logic that we had before the new whitespace patch landed. See commit https://github.com/microformats/php-mf2/pull/163/commits/4d46586af1dda763dc067f1d6e2c2b650615f674. This basically reverts a commit made in March 2018, but only for the purpose of running the tests.

I hope that clears up some questions!

aimee-gm commented 4 years ago

@njkleiner thank you for the comprehensive comparison of parsers for this issue :slightly_smiling_face: it's very helpful!

I have opened a draft pull (#52) request to add an experimental option to enable this.

At present, it only collapses whitespace in properties and values (it does not apply to rels, but I haven't though about if it should handle these yet), and does not do any of the whitespace algorithm described by @Zegnat - although I think this would be the way to go with this experimental option.

aimee-gm commented 4 years ago

@njkleiner with v1.4.0 there's now support for the textContent experimental flag that implements the improved text content handling.

I am considering how we can enable some of these experimental options, perhaps by default at some point.