inbo / whip

✅ Human and machine-readable syntax to express specifications for data
MIT License
7 stars 0 forks source link

When does a value need to be quoted? #15

Closed peterdesmet closed 7 years ago

peterdesmet commented 7 years ago

We should probably provide a recommendation when values should be quoted:

license:
  allowed: http://creativecommons.org/publicdomain/zero/1.0/ # is this valid or does it need quotes?
informationWithheld:
  allowed: see metadata # is this valid or does this need quotes?
stijnvanhoey commented 7 years ago

Good comment.

Other relevant examples to take into account are:

length:
    numberformat: .3  # interpreted as float 0.3 by pyyaml
code_id:
    regex: [D-G]  # should be `[D-G]`, otherwise interpreted as list by pyyaml

YAML specs info

tl;dr: applications such as pyyaml will implicitly extract the data type of the input, resulting in float, int, list,.. data types in the interpreted yaml file...

We should check this with respect to the general YAML specifications. I'll checked the docs as try to bring some short summary here. Central are the nodes

A YAML node represents a single native data structure. Such nodes have content of one of three kinds: scalar, sequence, or mapping.

When checking the information on [native data structures](http://yaml.org/spec/1.2/spec.html#native data structure//):

YAML represents any native data structure using three node kinds: sequence - an ordered series of entries; mapping - an unordered association of unique keys to values; and scalar - any datum with opaque structure presentable as a series of Unicode characters. Combined, these primitives generate directed graph structures. These primitives were chosen because they are both powerful and familiar: the sequence corresponds to a Perl array and a Python list, the mapping corresponds to a Perl hash table and a Python dictionary. The scalar represents strings, integers, dates, and other atomic data types.

Hence, these scalar representations are essential, as these define our data specifications itself (e.g. see metadata in the example). The interpretation of the data types can be handled by tags.

YAML represents type information of native data structures with a simple identifier, called a tag

However, as we probably not want to let users provide the appropriate tags in each field themselve, this is subjected to implicit tag definition:

In YAML, untagged nodes are given a type depending on the application.

For example, when checking the documentation of the pyyaml package:

Plain scalars without explicitly defined tags are subject to implicit tag resolution. The scalar value is checked against a set of regular expressions and if one of them matches, the corresponding tag is assigned to the scalar. PyYAML allows an application to add custom implicit tag resolvers.

More information on the tag resolution according to the Core Schema is provided here. This implicit tag resolution is according to the examples above.

What now?

We could actually start to adapt the YAML-file resolution within the pyyaml application (e.g. the function add_implicit_resolver can be used to provide custom resolvers). As such, we maybe can resolve all specifications (sections after :) as (unicode) string.

Notice:

I would rather keep as close to general YAML handling (i.e. pyyaml load function results in object that can be used as such) and document the necessity of the quotes for certain specifications in the whip specifications. To my knowledge (and currently tested), quotes are required for the following specifications: regex, dateformat and numberformat.

peterdesmet commented 7 years ago

Agree to document the necessity of the quotes for certain specifications in the whip specifications. That is indeed the case (and documented) for: regex, dateformat and numberformat.