Closed njkleiner closed 4 years ago
Implementing this change causes several failures with the microformats test suite (https://github.com/microformats/tests/tree/master/tests/microformats-v2).
These tests also fail using the PHP parser.
I would like to use the microformats test suite as a "source of truth" for how a parser should work. I don't think this is a bug, but a behaviour change. It could be implemented behind an experimental toggle?
To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.
The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?
Implementing this change causes several failures with the microformats test suite (microformats/tests:tests/microformats-v2@
master
).
I didn't realize that was the case. That's interesting considering that this behavior seems to be somewhat common.
I would like to use the microformats test suite as a "source of truth" for how a parser should work.
Definitely makes sense.
To change this parsing behaviour without an experimental toggle, this will either need to change the parsing specification or the test suite.
I've gone ahead and created a quick overview of how it's implemented across some projects. I think the pattern there is quite interesting.
Name | Uses test suite | Implements collapse whitespace |
---|---|---|
php-mf2 | ❌ | ✔️ (default) |
mf2py | ❌ | ✔️ (default) |
microformat-shiv | ❌ | ✔️ (feature) |
micromicro | ✔️ | ❌ |
microformats-parser | ✔️ | ? |
Given that it's the default behavior in some parsers and it's arguably useful, I think we should implement it (behind a feature flag, for now).
Also, I think there's definitely a discussion to be had about how this fits in with the official specification.
Background: we found that user expectations did not really match the parsing spec in all cases (e.g. when it comes to consequtive whitespace). This is being discussed as a spec issue.
PHP and Python both implement a version of an algorithm I wrote out. I am saying “a version of” as I am not actually completely sure on the details anymore and would not want to claim they match completely. (For even more complexity, there has been a try to find out what is needed to match browser specs more closely. Again in PHP and Python.)
The PHP parser has an open PR to use the test suite, I would be interested in what changes are being made with the way they parse these test scenarios, if they do?
I am cheating.
When running the tests from the test suite I default to the text logic that we had before the new whitespace patch landed. See commit https://github.com/microformats/php-mf2/pull/163/commits/4d46586af1dda763dc067f1d6e2c2b650615f674. This basically reverts a commit made in March 2018, but only for the purpose of running the tests.
I hope that clears up some questions!
@njkleiner thank you for the comprehensive comparison of parsers for this issue :slightly_smiling_face: it's very helpful!
I have opened a draft pull (#52) request to add an experimental option to enable this.
At present, it only collapses whitespace in properties and values (it does not apply to rels, but I haven't though about if it should handle these yet), and does not do any of the whitespace algorithm described by @Zegnat - although I think this would be the way to go with this experimental option.
@njkleiner with v1.4.0 there's now support for the textContent
experimental flag that implements the improved text content handling.
I am considering how we can enable some of these experimental options, perhaps by default at some point.
Describe the bug
When implying the
value
property for a nested microformat (e.g.,h-adr
insideh-entry
) from the HTMLtextContents
, multiple successive whitespace characters should be collapsed to a single space character.To Reproduce
HTML input:
Expected behavior
Correct JSON output:
Actual JSON output:
Note the difference
Berlin, Berlin, DE
vs.Berlin,\n Berlin,\n DE
.Additional context
From what I can tell, this is not actually part of the specification, it seems to be commonly accepted though, as both the PHP parser and the Python parser do this.