Array brackets as shortcuts conflicts with YAML

DylanVanAssche commented 1 year ago

During the VAIA course we received the following feedback:

it's a well-known shorthand in standard YAML... the use of square brackets here for YARRML conflicts that YAML shorthand

Array brackets conflict with standard YAML as it is a shorthand for arrays. Just putting it here so we can incorporate this later.

bjdmeest commented 1 year ago

For which specific case? Are you talking about, e.g. XPath? bc I would be hesitant to change the XPath syntax and just add a claim in the spec that, it the reference contains special characters such as [ and ], you should put the template in quotes

DylanVanAssche commented 1 year ago

Everything actually, not a specify reference formulation or something, it is about the syntax for a POM for example:

po:
  - [ foaf:name, $(name), en~lang ]

The array is not 'equal' regarding the types, it is actually an object put into an array where [0] matches the predicates, [1] matches the objects and en~lang the language tag as Literal.

Their comment was mostly about that the brackets suggest an array with [ and ] as in YAML but that it is not an 'array' in practice, but more like an object. Therefore, the shortcut felt a bit non-intuitive when you are used to YAML. I kinda see where this is coming from though. Not sure how to solve it though, just putting it here so we can think about it when standardizing YARRRML.

midorna commented 3 weeks ago

This issue is open for quite some time, and I agree with @DylanVanAssche's observations. Some thoughts on this were shared already with Dylan and Ben via email some months ago, which led to my proposal using a functional notation pom() for a future standard.

Proposal
Use pom(<predicate>, <object> [, <object type>]) as a template for a predicate-object mapping (POM) to avoid the current YAML list notation.
Lists of predicates and lists of objects are allowed, as defined in the current YARRRML specification. The third parameter, which can be used to define the object's type, is optional (default: xsd:string).

Example with current syntax:

...
    po: [[foaf:name, $(name), en~lang]]

Example with proposed syntax:

...
    po: [pom(foaf:name, $(name), en~lang)]

Some further points for a discussion:

Shall we allow a single POM as value for the field predicateobjects, or is always a list of POMs required (as it is defined now)? The example above would then look like
```
...
    po: pom(foaf:name, $(name), en~lang)
```
The tilde can be used as operator in different parts of a mapping definition with different meanings (in the example: identification of a language tag). In POMs, an IRI object must also be defined using the tilde (e.g. in pom(:id, my:$(ID)~iri)).
Shall we avoid this type of overloading, e.g. by writing pom(:id, my:$(ID), iri)?
Shall we add a parameter for conditions?
Note: A shortcut notation for join conditions was proposed already in (Iglesias-Molina et al. 2023), also using functional notation.

bjdmeest commented 2 weeks ago

First, a minor clarification: you currently can do po: [foaf:name, $(name), en~lang], i.e. no nested array is required.

The current YARRRML syntax is a trade-off between (i) user-friendliness and (ii) maximally relying on the existing YAML syntax (and by extension, existing YAML parsers). By using the existing [ ] YAML construct, we can rely on current array parsing functionality, whereas if you parse the suggestion below as YAML, you get the JSON below

po: pom(foaf:name, $(name), en~lang)

{
  "po": "pom(foaf:name, $(name), en~lang)"
}

i.e. this still requires some additional parsing on top of the YAML to JSON conversion. Not a huge deal, but that means your YARRRML-parser needs to take all kinds of quirks into account, e.g. when your reference formulation also contains , in its syntax, you suddenly need to take escaping into account yourself, and you could choose to stay close to YAML syntax, i.e. describe it like something below (as YAML basically says 'if you want to use the exact string, put in quotes)':

po: [foaf:name, "$(na,me)", en~lang]

or you could choose to add escaping to the YARRRML language, soemthing like below

po: pom(foaf:name, $(na\,me), en~lang)

So you now have to add support for escaping in your YARRRML parser. So now we need to add more documentation in YARRRML instead of relying on existing YAML.

By using the [ ] notation, you rely on existing YAML syntax and functionality, because adding a , in a list element requires you to encapsulate the element in double quotes, so

po: [foaf:name, "$(na,me)", en~lang]

becomes

{
  "po": [
    "foaf:name",
    "$(na,me)",
    "en~lang"
  ]
}

Honestly, I don't see an improvement there: the po is one big string that actually means a function (so semantically it's a bit similar to the current YARRRML syntaxt that is an array that actually means a function), the parser needs to do more work outside of the YAML spec, and the resulting line is in fact more verbose than the initial YARRRML syntax (po: [foaf:name, $(name), en~lang]).

In the end, it's still just a shortcut (so then I'd rather remove the shortcut than add a lot of burden to the limited number of YARRRML developers), hopefully the object-style syntax is clear enough for lay people. We could update our tutorials etc to focus on the object-style syntax, if that would remove confusion.

Another solution would be to start from a completely functional syntax (eg stemming from a functional programming language), but that feels like a big endeavor. Maybe an SDK to programmatically create mappings would make more sense then.

midorna commented 2 weeks ago

Thanks for the feedback, @bjdmeest.

First, a minor clarification: you currently can do po: [foaf:name, $(name), en~lang], i.e. no nested array is required.

That's good to know since the specification shows only lists of POs and all examples I know of show lists, too.

The current YARRRML syntax is a trade-off between (i) user-friendliness and (ii) maximally relying on the existing YAML syntax (and by extension, existing YAML parsers). [..]

Since I share this point of view regarding the trade-off, I would like to strengthen again the point, that POs are not common YAML lists, but fixed tuples. Reusing a syntax for different purposes does not mean that it gets more user-friendly. I see a conflict here because those PO-lists need specific treatment anyhow. When YAML was parsed and you run into an issue, this is just post-poned as if using po() as a string. String parsing is required to identify prefixes, references, tildes and other types of shortcuts anyhow, right?

Maybe the main point here is whether functional notation should be integrated in YARRRML, or not. I think, it helps regarding user-friendliness, since common YAML configurations are not using deeply nested dictionaries in a way, YARRRML does, e.g. considering conditions and nested functions. During mapping development, functional style notation with open and closing parentheses and fixed number of parameters help to get rid of indentation issues, too. Hence, I think this is even more user-friendly and at the same time considering compatibility with YAML syntax.

kg-construct / yarrrml-spec

Array brackets as shortcuts conflicts with YAML #8