Pattern string formats for parsing dates/numbers/durations

JeniT commented 9 years ago

What pattern string formats should we use. There are pattern string formats defined in Unicode TR35.

JeniT commented 9 years ago

I'd like to propose that we drop the requirement to define pattern strings for parsing values; the pattern strings in Unicode TR35 (eg for dates) are extremely complicated and it's a large implementation burden to support the full set of locales.

danbri commented 9 years ago

I support the proposal to drop this as a requirement. We could include an informational reference to the relevant standards, as a way to encourage toolmakers to add non-standard but useful extras.

JeniT commented 9 years ago

To help resolve this, @6a6d74 will look at ISO8601 format strings to see if they're simpler than TR35. I will look at what it would take to properly support parsing numbers.

6a6d74 commented 9 years ago

As suspected, ISO 8601:2004 (the most recent version) does not cater for variations in formatting the way we need things. It's strict about the way the date-time is formatted (albeit allowing variations like 'week number', 'day in year' etc. alongside the normal year-month-day representation).

TR35 looks more complex than I think is warranted.

However, the [“XPath and XQuery Functions and Operators 3.0” 9.8 Formatting dates and times][http://www.w3.org/TR/xpath-functions-30/#formatting-dates-and-times] looks to cover what we need whilst not being super complex. Of course - we're interested in converting from the locale-dependent number strings to xsd date-time (and gYear, gMonth etc.), which is the opposite direction to the functions in XPath. That said, the rules are very clearly defined & should be reversible.

(Also see [4.7 Formatting numbers][http://www.w3.org/TR/xpath-functions-30/#formatting-numbers])

I think that it would be possible to express a testable set of functions for dealing with picture strings as defined for xpath 3.0 - especially if we assert things like "The ISO (ISO 8601) calendar must be supported. Support for other calendars may be provided, in which case conversion between different calendars is implementation defined".

There are a few other challenges to deal with ... I'll try to write some testable rules.

gkellogg commented 9 years ago

This relates to a recent thread in public-vocals which I commented on [1]. The Microdata to RDF spec looks for specific XSD patterns, but ISO 8601 allows a wider range of formats, but still not as free formed as most spreadsheet content.

For those of strong stomach, here's the (ruby extended) regular expression I use to match many 8601 patterns:

ISO_8601 =  %r(^
  # Year
  ([\+-]?\d{4}(?!\d{2}\b))
  # Month
  ((-?)((0[1-9]|1[0-2])
        (\3([12]\d|0[1-9]|3[01]))?
      | W([0-4]\d|5[0-2])(-?[1-7])?
      | (00[1-9]|0[1-9]\d|[12]\d{2}|3([0-5]\d|6[1-6])))
      ([T\s]((([01]\d|2[0-3])((:?)[0-5]\d)?|24\:?00)
             ([\.,]\d+(?!:))?)?
            (\17[0-5]\d([\.,]\d+)?)?
            ([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?
      )?
  )?
$)x.freeze

gkellogg commented 9 years ago

Aside: I often consider regular expressions a "write only" syntax, as you can create them, but trying to understand one by reading it is a job best left to computers! :)

danbri commented 9 years ago

discussion: http://www.w3.org/2014/11/19-csvw-irc

danbri commented 9 years ago

My proposal based on chat w/ jtandy + ivan is:

a hash structure of shortnames / values where shortname is identifier of a picture string language.
we document a convention that encourages publishers of non-8601-based datetime columns to also add a parallel column with the dates converted into simple 8601 representation. And metadata structure to make this clear.

gkellogg commented 9 years ago

+1

iherman commented 9 years ago

To complement these, what we thought of proposing is:

the hash structure for picture strings MAY be added to the metadata where the keys identify a programming environment (Java, Python, Ruby, JavaScript, etc.) in the absence of it the literal is just copied to the output verbatim (meaning that if it is not in correct ISO date format, the RDF will be invalid)
implementations MAY implement some form of formatting based on the picture string, provided it is available for that particular implementation's environment

Ivan

On 19 Nov 2014, at 16:57 , danbri notifications@github.com wrote:

My proposal based on chat w/ jtandy + ivan is:

a hash structure of shortnames / values where shortname is identifier of a picture string language.

we document a convention that encourages publishers of non-8601-based datetime columns to also add a parallel column with the dates converted into simple 8601 representation. And metadata structure to make this clear.

— Reply to this email directly or view it on GitHub.

Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

danbri commented 9 years ago

Re "copied verbatim", ... I assume you have the mapping to RDF/JSON use cases most heavily in mind. But are the picture strings also useful just as metadata to document the meaning of fields? even if not doing a bulk conversion...

iherman commented 9 years ago

That is probably correct.

Ivan

On 19 Nov 2014, at 18:37, danbri notifications@github.com wrote:

Re "copied verbatim", ... I assume you have the mapping to RDF/JSON use cases most heavily in mind. But are the picture strings also useful just as metadata to document the meaning of fields? even if not doing a bulk conversion...

— Reply to this email directly or view it on GitHub.

6a6d74 commented 9 years ago

Here's a summary of the key points from the teleconf discussion of Wed-19-Nov-2014

lots of data will be provided with ISO 8601 formatted date-times; but
lots of data will be provided with "local" date-time formats (such as the US MM-DD-YYYY convention)
it seems unreasonable to restrict the CSVW spec to only ISO 8601 formatted date-times
JSON and RDF formats will expect date-times in ISO 8601 format (e.g. xsd:datetime)
there are lots of date-time formatting specs and implementations "out there" - but there's no standardisation; XPath 3.0 functions, Java simpledateformat, JavaScript momentJS and simpledateformat from jaxson, python datetime, ruby and many more
each of these date-time formatting specs has its own picture string syntax (sigh)
Calendars and Eras are (important) edge cases ... but none of the use cases deal with data where date-times are not gregorian / common era, so conversions between calendars or eras is a nice to have
names, numbers expressed as words or ordinals etc. also add complexity when considering locale/language ... again, none of the use cases cover this
we almost decided that the Rec should cater for: gregorian calendar + common era + numeric values (no names) + picture strings to describe the syntax
(noting that one might be able to deal with this level of 'format parsing' using REGEXP)
... but then this would still require us to specify a picture string syntax and write a bunch of abstract tests (which is a lot of work)
so we decided that the Rec would ...
(i) document a convention that encourages publishers of non-8601-based datetime columns to also add a supplementary column with the dates converted into simple 8601 representation - and metadata structure to make this clear, and
(ii) allow data publishers to include a 'picture-string' for the date format in the metadata following one of the well known syntaxes - the syntax choice would be specified using a well-known hash (that is defined in this Rec)
(iii) if the implementation does not support that picture-string syntax OR there is no picture-string supplied, the literal string would simply be copied verbatim into the JSON or RDF output
obviously, unless the original literal is ISO 8601 format, this will likely cause an error when the JSON or RDF is parsed - but what can you do if people don't play by the rules!

So basically, we're 1) recommending that CSV is published with ISO 8601 date formats 2) noting that implementations may try to use picture-string to interpret 'naturally formatted' date-time representations 3) asserting that in the absence of a supported picture-string, the literal value is copied directly during the mapping (which might cause problems for downstream validation of RDF or JSON)

(I am assuming that dealing with 'naturally formatted' numbers, e.g. with comma ',' as decimal separator, will be treated in the same way)

6a6d74 commented 9 years ago

The proposal above (with an ISO 8601 date-time complementing the 'naturally formatted' date-time) assumes that the naturally formatted date-time column(s) can be suppressed in the output.

Need to ensure that 'column suppression in mapping' is supported.

Also I wonder how easy it would be to use a REGEXP on the naturally formatted date-times to create a 'virtual' column when parsing the data?

(I'll think about that some more!)

iherman commented 9 years ago

On 21 Nov 2014, at 16:42 , Jeremy Tandy notifications@github.com wrote:

The proposal above (with an ISO 8601 date-time complementing the 'naturally formatted' date-time) assumes that the naturally formatted date-time column(s) can be suppressed in the output.

Need to ensure that 'column suppression in mapping' is supported.

Yes.

Also I wonder how easy it would be to use a REGEXP on the naturally formatted date-times to create a 'virtual' column when parsing the data?

I just try to understand what you mean: would the metadata include a regexp pair (from and to) that should be applied on a literal before output? With the 'group' feature of the usual regular expression syntaxes that may work. I am just a little bit afraid of opening up the floodgates of some sort of an extra mapping mechanism for the mapping...

Another problem with this: as Jeremy put it, I believe, regexp is a write only language. Our authors are not necessarily computer techies; asking them to write a regexp (and test it!) rather than a picture string might be a really tall order.

Ivan

(I'll think about that some more!)

— Reply to this email directly or view it on GitHub.

Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

6a6d74 commented 9 years ago

Good points ... as ever I'm pushing the boundaries (but mindful of our agreement regarding "simple mapping")

Jeremy

rufuspollock commented 9 years ago

+1 on simplicity (e.g. ISO 8601) - I also imagine we would likely have this as SHOULD rather than MUST.

JeniT commented 9 years ago

@6a6d74 can you provide an example of the metadata syntax that you have in mind based on the above discussion? It's not really clear to me.

Regarding suppressing columns, I think it's actually useful to maintain the original human-readable data so I don't think that recommending two columns necessitates having a feature to suppress columns. (Can we have a separate issue for suppressing columns please?)

JeniT commented 9 years ago

Is it something like:

"datatype": "date",
"format": {
  "unicode": "dd MMM yyyy",
  "xpath": "[D01] [MN,*-3] [Y0001]"
}

We would define unicode as meaning the TR35 format patterns and xpath as meaning the XPath patterns and enable people to use other ones if they want to (we'd need to have a registry).

I note that XPath has additional arguments for language, calendar and place when creating these strings. Should we provide mechanisms in the metadata to allow these to be set? TR35 has a whole section about how a calendar is defined (which is then used to parse/format the date).

I'm guessing that support for all of these, and for particular languages and calendars, would be implementation defined for 1.0 ie not something where we would insist on compatibility?

JeniT commented 9 years ago

For numbers, TR35 suggests that you can parse numbers without knowing what format they're actually in and the main ambiguity is over the grouping and decimal separators. So that would suggest having something like:

"datatype": "number",
"pattern": {
  "decimalChar": ","
  "groupingChar": " "
}

and having rules along the lines of:

ignore any prefix up until the first digit or sign character
parse any sign character (see By-Type chart for different possible sign characters)
read digits, ignoring any groupingChar characters up until either the decimalChar or an exponent character (see By-Type chart for different exponent characters)
if there's a decimalChar keep reading digits until an exponent character
read any exponent character
read any sign character
read any exponent digits
read any percent or per-mille character (see By-Type chart for different percent and per-mille characters)

Presumably if there is a percent or per-mille character then the resulting number should be divided by 100/1000 when mapping into a numeric value?

Should people be able to specify a particular format for the number in the schema, eg "#0.0%" to validate that all the values are in that format? I'm guessing that would be useful but it makes it more complicated again, because there are a range of special symbols in those pattern strings (see TR35 again).

iherman commented 9 years ago

For dates we decided not to go down the route of the full unicode standard; instead, the approach was to ask for a standard format, and have a structure whereby the metadata can specify the various possible "picture strings", assigned to various programming languages, that describe the data. This means implementations do not have to implement complex parsers but can rely on the 'standard' tools of their respective environments.

I would prefer to be consistent and choose the same approach for numbers.

Ivan

Ivan Herman Tel:+31 641044153 http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

On 26 Nov 2014, at 18:10, Jeni Tennison notifications@github.com wrote:

For numbers, TR35 suggests that you can parse numbers without knowing what format they're actually in and the main ambiguity is over the grouping and decimal separators. So that would suggest having something like:

"datatype": "number", "pattern": { "decimalChar": "," "groupingChar": " " } and having rules along the lines of:

ignore any prefix up until the first digit or sign character parse any sign character (see By-Type chart for different possible sign characters) read digits, ignoring any groupingChar characters up until either the decimalChar or an exponent character (see By-Type chart for different exponent characters) if there's a decimalChar keep reading digits until an exponent character read any exponent character read any sign character read any exponent digits read any percent or per-mille character (see By-Type chart for different percent and per-mille characters) Presumably if there is a percent or per-mille character then the resulting number should be divided by 100/1000 when mapping into a numeric value?

Should people be able to specify a particular format for the number in the schema, eg "#0.0%" to validate that all the values are in that format? I'm guessing that would be useful but it makes it more complicated again, because there are a range of special symbols in those pattern strings (see TR35 again).

— Reply to this email directly or view it on GitHub.

JeniT commented 9 years ago

I am struggling to understand what this looks like in practice. Can you supply a sample of a metadata document to illustrate?

JeniT commented 9 years ago

(Agreed about aiming for consistency in approach between dates and numbers, though I think in practice there are very different considerations and levels of complexity.)

iherman commented 9 years ago

I am making this up as I write, because we did not work out the details. But it would be something like (I use the date example):

{ "datatype" : "date", "format" { "python" : "%m/%d/%Y", # standard python format "javascript" : "M/D/YYYY", # javascript's moment.js format etc. } }

Not pretty, I am the first one to say. On the other hand, referring to the Unicode standard for picture string that, afaik, nobody really implements is a major drag; it means all implementations, checkers or converters, will have to implement complex parsing for the datatypes, and do not believe this will really happen:-(

Ivan

On 26 Nov 2014, at 20:51 , Jeni Tennison notifications@github.com wrote:

I am struggling to understand what this looks like in practice. Can you supply a sample of a metadata document to illustrate?

— Reply to this email directly or view it on GitHub.

Ivan Herman, W3C Digital Publishing Activity Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

6a6d74 commented 9 years ago

@JeniT: for numbers I think that your proposal (specifying decimal and grouping chars) covers most of the issues.

However, is there a case where, when validating CSV files, you would wish to check the number of decimal places used? (i.e. where there are mandatory digits in the number string). For example:

#,###.00 requires two decimal places, so 12,345.60 is OK but 12,345.6 is not.

I'm guessing that this is probably beyond what we want/need - certainly the use cases don't cover this. So just specifying the grouping and decimal-separator characters is probably enough.

If we do adopt picture strings for numbers, note that the XPath function for number formatting includes some useful statements for parsing picture strings.

In fact, there are a number of rules spec'd by XPath that might be applied when we parse the number fields themselves ...

must not contain more than one decimal-separator-sign
must not contain more than one percent-sign or per-mille-sign, and it must not contain one of each
must not contain a grouping-separator-sign adjacent to a decimal-separator-sign

(etc. ... and it also talks about repeating patterns for grouping characters too)

6a6d74 commented 9 years ago

Looking at numbers and date-times, I think it's ok to have different approaches (e.g. one requires a full picture string, the other does not).

However, we have pattern used in the example for numbers and format used for date-time. I think that it would be useful to have a single token for both ... don't mind which.

Also, I note that there's a mismatch between numbers and date-times ...

numbers talks about specific formatting chars
date-times jump straight into documenting picture strings for different formats.

Feels a bit discordant ... would suggest something like:

"datatype": "number",
"format": {
  "decimalChar": ","
  "groupingChar": " "
}

"datatype": "date",
"format": {
  "picture-strings": [
      "unicode": "dd MMM yyyy",
      "xpath": "[D01] [MN,*-3] [Y0001]"
  ]
}

You can see that I've added a picture-strings array to the date formatting statement. Therefore we could now add more formatting information to the date element (as a peer to picture-strings) should later versions of the CSVW spec demand it.

6a6d74 commented 9 years ago

@JeniT: added a new issue for column suppression - #64

6a6d74 commented 9 years ago

@JeniT: should the number formatting also include specification of tokens for things like missing values (see R-MissingValueDefinition), NaN, Infinity etc.

6a6d74 commented 9 years ago

@JeniT: and here's another one ... currency token

(noting that the metadata for the column might provide additional metadata about the actual currency so that a value with token $ could be explicitly related to HK Dollar rather than assuming USD)

See ISO 4217 "Codes for the representation of currencies and funds" for authoritative currency codes

6a6d74 commented 9 years ago

There is also the habit of expressing negative numbers in parentheses ... often used in accounting. So we'd want to be able to parse (12,522) as -12522

JeniT commented 9 years ago

discussed 14/1/2015 and decided to support a fixed set of known/popular date-time formats, which we will list, probably using Unicode picture string formats for the names. Editors to propose a list and work out whether this needs to be supplemented with the names/abbreviations for months.

JeniT commented 9 years ago

Note that this should be handled in the metadata document, and conversion documents do not need to repeat the logic.

JeniT commented 9 years ago

This can be closed when both #171 and #172 are closed.

w3c / csvw

Pattern string formats for parsing dates/numbers/durations #54