microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

should dt-* parsing do date and time parsing for all values? #12

Open tantek opened 7 years ago

tantek commented 7 years ago

Currently in http://microformats.org/wiki/microformats2-parsing#parsing_a_dt-_property special date and time parsing is only done as part of step one for VCP handling.

The proposal is to move (and thus extract from VCP and inline into mf2 parsing) that "date and time parsing rules" mentioned in step 1 to after all the value retrieval is done, before returning a value.

This would be a larger fix that should incorporate also accepting the proposals in issue #4 and #8 .

I don't have a specific real world example for this particular proposal, thus the issue title is a question. All feedback welcome, and especially real world examples that would be helped by this beyond the smaller fixes noted in #4 and #8.

Feedback explicitly requested from: @sknebel @gRegorLove @Zegnat. Thanks!

tantek commented 7 years ago

We can also leave this open longer, and just move forward with #4 and/or #8 until we have more evidence or consensus one way or the other.

Zegnat commented 7 years ago

My answer to the question in the title would be Yes.

I feel like dt-* handling should describe how a string gets turned into a datetime stamp. No matter where the string is coming from (textContent, attribute, VCP, …). I also think this would give parsers an easier job.

As I wrote in #8 and on IRC (emphasis added that is implicit to this issue):

is there some way we can generalise a vcp-to-string algo for dt- and then generalise a string-to-valid-timestamp algo that works on the string value of the dt-, then it no longer matters if that string value was obtained through regular parsing or through vcp.

gRegorLove commented 6 years ago

Here is a real-world example we ran into today on http://indieweb.org/events.

<span class="h-event vevent">
    <span class="dt-start dtstart">
        <span class="value" title="August 1, 2018">2018-08-01</span>
        <span class="value" title="20:30">20:30<span style="display: none;">-5:00</span></span>
    </span>–
    <span class="dt-end dtend">22:00<span style="display: none;">-5:00</span></span> (-5:00 <abbr>UTC</abbr>):
<span class="p-content">An informal online get together for people new to blogging, building websites, or using IndieWeb plugins on WordPress.</span>
</span>

php-mf2 parse:

{
    "items": [
        {
            "type": [
                "h-event"
            ],
            "properties": {
                "content": [
                    "An informal online get together for people new to blogging, building websites, or using IndieWeb plugins on WordPress."
                ],
                "start": [
                    "2018-08-01 20:30-0500"
                ],
                "end": [
                    "22:00-5:00-0500"
                ]
            }
        }
    ],
    "rels": {},
    "rel-urls": {},
    "debug": {
        "package": "https://packagist.org/packages/mf2/mf2",
        "source": "https://github.com/indieweb/php-mf2",
        "version": "v0.4.5",
        "note": [
            "This output was generated from the php-mf2 library available at https://github.com/indieweb/php-mf2",
            "Please file any issues with the parser at https://github.com/indieweb/php-mf2/issues",
            "Using the Masterminds HTML5 parser"
        ]
    }
}

mf2py parse:

{
    "rels": {}, 
    "items": [
        {
            "type": [
                "h-event"
            ], 
            "properties": {
                "content": [
                    "An informal online get together for people new to blogging, building websites, or using IndieWeb plugins on WordPress."
                ], 
                "start": [
                    "2018-08-01"
                ], 
                "end": [
                    "22:00-5:00"
                ]
            }
        }
    ], 
    "rel-urls": {}, 
    "debug": {
        "source": "https://github.com/microformats/mf2py", 
        "version": "1.1.1", 
        "markup parser": "html5lib", 
        "description": "mf2py - microformats2 parser for python"
    }
}
sknebel commented 5 years ago

I guess this makes sense. VCP and the HTML rules for the datetime attribute of the <time> element are probably good starting points of syntax to accept, with the latter maybe being the output format too?

jalcine commented 2 years ago

After having https://github.com/microformats/tests/issues/29 confirmed and resolved, the lack of this being in the standard is the only thing preventing the Rust parser from being fully compliant, thus enabling this: https://github.com/microformats/microformats2-parsing/issues/12#issuecomment-331626987

(Originally published at: https://jacky.wtf/2022/6/yy8Z)

gRegorLove commented 2 years ago

I found some more edge cases that this spec update should cover:

  • if the value has a specific ISO8601 date, time, and timezone, use those and stop looking for "value" elements.
<div class="h-event">
  <span class="dt-start">
    <span class="value">2022-07-05T17:30-08:00</span>
  </span>
</div>

This "value" is used as-is, no normalization to remove "T" or the colon in timezone offset:

{
 "items": [
  {
   "type": [
    "h-event"
   ], 
   "properties": {
    "start": [
     "2022-07-05T17:30-08:00"
    ], 
    "name": [
     "2022-07-05T17:30-08:00"
    ]
   }
  }
 ]
}

Similarly for:

  • if the value has both a specific ISO8601 date and time, use those
<div class="h-event">
  <span class="dt-start">
    <span class="value">2022-07-05T17:30</span>
  </span>
</div>
{
 "items": [
  {
   "type": [
    "h-event"
   ], 
   "properties": {
    "start": [
     "2022-07-05T17:30"
    ], 
    "name": [
     "2022-07-05T17:30"
    ]
   }
  }
 ]
}
jalcine commented 11 months ago

I mentioned before how this is a upstream blocker to get the Rust library fully compatible. That's changed but normalization would simplify the act of parsing (and testing) date values, thus me throwing my vote in favor of it and curious to hear if anyone else is in favor of that as well.

(Originally published at: https://jacky.wtf/2023/10/evyZ)

JKingweb commented 11 months ago

I'm also in favour of normalizating date values everywhere, be it VCP or not. Parsers already have to perform normalization sometimes, so it adds no appreciable complexity to parsers, while simplifying things for consumers of the output of parsers.

My own parser already does this by default, for what it's worth.

JKingweb commented 11 months ago

While we're at it it might be worthwhile to drop : from time zones and transform Z to +0000 so that downstream consumers only have to deal with five formats in the JSON:

It's a pretty straightforward application of Postel's law, with no information lost, and no new formats added.