kg-construct / rml-core

RML-Core: Main features for RDF generation with RML
https://w3id.org/rml/core/spec
Creative Commons Attribution 4.0 International
12 stars 8 forks source link

Join specification when logical source is the same #74

Open bjdmeest opened 1 year ago

bjdmeest commented 1 year ago

Let's say we have two triple maps that refer to the same logical source (and with same, we really mean same URI, not "same because the descriptions lead to the semantically same logical source").

Sample source (CSV)

id,parent_id
1,2
2,1

Base mapping (YARRRML)

prefixes
  ex: http://example.com#
sources:
  test: [data.csv]
mappings:
  test1:
    s: ex:$(id)
    po:
      p: ex:parent
      o:
        mapping: test2
  test2:
    s: ex:$(parent_id)

We have following use cases that are underspecified in de spec

the spec currently says If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

  1. If a join condition is specified AND the logical source is not the same: common case, execute join condition between each iteration pair
  2. If a join condition is specified AND the logical source is the same: same as above
  3. If no join condition is specified AND the logical source is not the same: do a full join (i.e., take all iterations into account)
    • example output: ex:1 ex:parent ex:2, ex:1 ex:parent ex:1, ex:2 ex:parent ex:2, ex:2 ex:parent ex:1
  4. If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account
    • example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
    • this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?
pmaria commented 1 year ago
  • If no join condition is specified AND the logical source is the same: don't do a full join, but take the current iteration into account

    • example output: ex:1 ex:parent ex:2, ex:2 ex:parent ex:1
    • this last one is the edge case, but allows to 'join per iteration'. Question is: should we make this edge case explicit, or should there be a different way to tackle this edge case?

Why is this last one an edge case? IMO this is regular and intended behavior.

R2RML states:

The child query of a referencing object map is the effective SQL query of the logical table of the term map containing the referencing object map.

The parent query of a referencing object map is the effective SQL query of the logical table of its parent triples map.

If the child query and parent query of a referencing object map are not identical, then the referencing object map must have at least one join condition.

The joint SQL query of a referencing object map is:

If the referencing object map has no join condition: SELECT * FROM ({child-query}) AS tmp

[...] The joint SQL query is used when generating RDF triples from referencing object maps.

So if the query of the logical table / iterator + ref. formulation + source of the logical source are equal, the child query / child logical source will be used for the reference object values. Thus the join will be executed per iteration.

bjdmeest commented 1 year ago

Ah, so then the current spec doesn't allow the use case "join over all iterations without join condition".

I see 2 potential paths:

  1. if we agree that we don't really need that use case: maybe clarify the spec a bit that both "the same logical source ID is reused in a different mapping" and "a different logical source with the same description (i.e. equal query of the logical table / iterator + ref. formulation + source of the logical source)" are interpreted as "identical logical sources".
    • we can still support this use case by, e.g. adding a join condition that always returns true
  2. if we agree that we do need that use case: one way to resolve that is to make the distinction between reusing the logical source ID vs have the same descriptions in different logical sources.

If there are no alternative opinions, I think I would vote for option 1: less edge cases and potential full joins are made explicit, with a trade-off that we need for every logical source type specify what are its 'identifying attributes' (e.g. a CSVW requires separator, null value specifier, ... all in its 'identifying attributes'), and the edge case is more verbose.

dachafra commented 1 year ago

As it's a join issue, I'm going to move it to its corresponding repo

elsdvlee commented 11 months ago

The ambiguity of 'identical logical sources' came up again during the TF Meeting on join 8/9/2023.

  1. Can we agree to add to the spec that 'identical logical sources' are logical sources with the same identifier (and not "same because the descriptions lead to the semantically same logical source")? This allows a clean definition that cannot be misunderstood. I see no need to accept also 'descriptions that lead to the semantically same logical source' as 'identical sources', as you can always chose that first option when you write a mapping file. Mapping engines can still implement strategies to optimize (some) joins with join conditions over 'semantically same logical sources', however that doesn't have to be described in the spec.
  2. In the new spec draft (https://kg-construct.github.io/rml-core/spec/docs/#joins) I don't find the reference to the joint queries of R2RML nor the requirement that a join condition is needed when logical sources are not identical (see the corresponding description of the old spec below). Can we agree to also add the old description to the new spec? Was there any reason to leave this out? Or is this underspecified because we are waiting for the outcome of the TF Meeting on Joins?

    The joint query is used when generating RDF triples from referencing object maps. If the logical source of the child triples map and the logical source of the parent triples map of a referencing object map are not identical, then the referencing object map must have at least one join condition.

pmaria commented 11 months ago

The ambiguity of 'identical logical sources' came up again during the TF Meeting on join 8/9/2023.

  1. Can we agree to add to the spec that 'identical logical sources' are logical sources with the same identifier (and not "same because the descriptions lead to the semantically same logical source")? This allows a clean definition that cannot be misunderstood. I see not need to accept also 'descriptions that lead to the semantically same logical source' as 'identical sources', as you can always chose that first option when you write a mapping file. Mapping engines can still implement strategies to optimize (some) joins with join conditions over 'semantically same logical sources', however that doesn't have to be described in the spec.

I don't see why we would need this. Nowhere else is this currently needed. It is, however, important to define that descriptions that lead to the same logical source are considered equal. For example, if you describe the same logical source on different triples maps with inline blank nodes. Equality via URI is basically a free gift from RDF.

I would also prefer to not use RDF language specifics to define behavior, since this may impede the introduction of non-RDF language bindings.

elsdvlee commented 7 months ago

We still need an agreement and specification in RML-core on:

  1. the definition of same logical source: option 1.1: same identifier option 1.2: descriptions that lead to the same logical source.
  2. the behaviour in case of a referencing object map / referencing term map without join conditions: option 2.1: natural join in case of same logical sources, error in case of different logical sources (this is in line with the R2RML spec and the old RML spec) option 2.2: natural join in case of same logical sources, cartesian product in case of different logical sources

In case of option 1.2: can we add also a clear description in the spec on how to compare the descripitons, e.g. string-wise comparison of rml:source and rml:iterator objects? Or should also a full comparision of the nested source be done. What in case of e.g. 2 decat descriptions with different identifiers leading to the same source?

A proposal for the spec, see join types in: https://github.com/elsdvlee/rml-core/blob/main/spec/docs/joinconditions.md Feel free to suggest improvements. This proposal is still to be adapted to any decision from the communitiy on the open questions raised above.

elsdvlee commented 7 months ago

@dachafra I propose to move this issue back to the rml-core repo.

dachafra commented 7 months ago

yep! Transferring it....!

dachafra commented 7 months ago

To continue the discussion of this issue, and considering that there is already a spec written, I would suggest making a PR @elsdvlee so the rest can review it and provide comments!

elsdvlee commented 7 months ago

@dachafra see pull request https://github.com/kg-construct/rml-core/pull/78

dachafra commented 7 months ago

awesome! Please assign @andimou @pmaria @bjdmeest @DylanVanAssche as potential reviewers

pmaria commented 4 months ago

My view on defining equality of logical sources:

Object equality in programming languages is used as the basis for many things. For example comparison in different data structures for uniqueness and hashing. (Think dictionaries, sets etc.) I strongly believe we should be able to leverage this for logical sources. I think source and logical source equality is something that is very useful to have when building RML processors.

Therefor, I would propose to come up with a definition of equality which can be implemented as such.

My proposal would be to define a logical source or source to be equal to another logical source or source if the RML-defined properties of the description of both are equal.

RML-defined: Those properties that are defined by a specification to have behavior that influences the behavior of an RML processor.

These properties MUST be listed for the rml:LogicalSource specification.

These properties MUST be listed for any rml:Source description.

Doing so will allow RML processors to map these descriptions to standard object equality mechanisms in their respective programming languages to best leverage the language's abilities.