kg-construct / mapping-challenges

Issues for discussion about limitations of current mapping languages
Apache License 2.0
4 stars 9 forks source link

Iteration over hierarchical documents: need to access fields outside the iteration #20

Open frmichel opened 3 years ago

frmichel commented 3 years ago

A need that was discussed a couple of time: during the iteration over hierarchical documents (e.g. with rml:iterator), it is no longer possible to access the fields above the iteration, or in other words outside of the iterated part. Although this is sometimes necessary, typically to build unique identifiers using ids at different hierarchical levels.

xR2RML proposes one solution to do that using the "pushDown" property.

herminiogg commented 3 years ago

I leave here a link to an issue about join conditions in rmlmapper-java implementation (https://github.com/RMLio/rmlmapper-java/issues/28). My last comment was about this same problem which I see troublesome, specifically in the case of JSON files. However, in older RML reference implementation (a.k.a., RML-Mapper) it seems to be working the other way round, so it was possible to keep this hierarchical information. Maybe this would be the right place to discuss if it is better to offer this implicitly or that the user have to declare it explicitly.

VladimirAlexiev commented 3 years ago

We need some systematic way to refer to parent and neighbor XML nodes, like what's available in xpath and xquery. To appreciate the complexity, I'll show a bit (5-10%) of our conversion of clinicaltrials.gov XML (CT) to RDF (using a custom ontology cto:).

Here is an outline. On the left is the CT element/attribute hierarchy, and on the right a mapping to props, classes and literals:

*      measure_list+
*        measure                                      cto:measure a cto:Measure
*          title+                                     dc:title
*          description?                               dc:description
*          population?                                cto:population
*          units?                                     cto:units
*          units_analyzed                             cto:unitsAnalyzed
*          param?                                     cto:param
*          dispersion?                                cto:dispersion
*          analyzed_list
*            analyzed: a Analyzed                     cto:analyzed a cto:Analyzed
*          class_list
*            class: a Class                           cto:class a cto:Class
*              title                                  dc:title
*              analyzed_list
*                analyzed: a Analyzed                 cto:analyzed a cto:Analyzed
*              category_list
*                category+: a Category                cto:category a cto:MeasureCategory
*                  title                              dc:title
*                  measurement_list
*                    measurement: a Measurement       cto:measurement a cto:Measurement
*                      xsd:string                     dc:title
*                      @group_id/substring(.,2)       cto:group substring(.,2)
*                      @value                         cto:value^^xsd:decimal
*                      @spread                        cto:spread^^xsd:decimal
*                      @lower_limit                   cto:lowerLimit^^xsd:decimal
*                      @upper_limit                   cto:upperLimit^^xsd:decimal

The "turtle with embedded fields" below shows the nodes and connectivity between them, and xpath to illustrate where the data comes from:

<(nct_id)/baseline/measure/($n)> a cto:Measure;
  puml:label "/clinical_study/clinical_results/baseline/\n  measure_list/measure";
  dc:title "(title)";
  dc:description "(description)";
  cto:population "(population)";
  cto:units "(units)";
  cto:unitsAnalyzed "(units_analyzed)";
  cto:param "(param)";
  cto:dispersion "(dispersion)";
  cto:analyzed <(nct_id)/baseline/measure/($n)/analyzed/($m)>;
  cto:class <(nct_id)/baseline/measure/($n)/class/($m)>.

<(nct_id)/baseline/measure/($n)/class/($m)> a cto:Class;
  puml:label "/clinical_study/clinical_results/baseline/\n  measure_list/measure/class_list/class";
  dc:title "(title)";
  cto:analyzed <(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)>;
  cto:category <(nct_id)/baseline/measure/($n)/class/($m)/category/($p)>.

<(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)> a cto:Analyzed;
  puml:label "/clinical_study/clinical_results/baseline/\n  measure_list/measure/class_list/class/analyzed_list/analyzed";
  cto:units "(units) # all are 'Participants'";
  cto:scope "(scope)";
  cto:participants <(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)/participants/($m)>.

<(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)/participants/($m)> a cto:ParticipantsCount;
  puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n  class_list/class/analyzed_list/analyzed/count_list/count";
  dc:title "(.) # most often missing";
  cto:group <(nct_id)/group/(@group_id/substring(.,2))>;
  cto:count "(@value)^^xsd:integer".

<(nct_id)/baseline/measure/($n)/class/($m)/category/($p)> a cto:MeasureCategory;
  puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n  class_list/class/category_list/category";
  dc:title "(title)";
  cto:measurement <(nct_id)/baseline/measure/($n)/class/($m)/category/($p)/measurement/($q)>.

<(nct_id)/baseline/measure/($n)/class/($m)/category/($p)/measurement/($q)> a cto:Measurement;
  puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n  class_list/class/category_list/category/measurement_list/measurement";
  dc:title "(.)";
  cto:group <(nct_id)/group/(@group_id/substring(.,2))>;
  cto:value "(@value)^^xsd:decimal";
  cto:spread "(@spread)^^xsd:decimal";
  cto:lowerLimit "(@lower_limit)^^xsd:decimal";
  cto:upperLimit "(@upper_limit)^^xsd:decimal".

In the URL <(nct_id)/baseline/measure/($n)/class/($m)/category/($p)/measurement/($q)> we use fields from 5 levels of the hierarchy: the root CT URL, and 4 sequence counters (see at $n below).

Let me know if you'd like to see a diagram of the complete model.

We have implemented this conversion with XSPARQL, which is XQuery plus templates to emit triples:

declare function local:rdf_measure ($url as xs:string, $base as xs:string, $meas as xs:string, $measure) {
  let $url := $url
  construct {
    <{$base}> cto:measure <{$meas}>.
    <{$meas}> a cto:Measure;
      dc:title          {$measure/title/text()};
      dc:description    {$measure/description/text()};
      cto:population    {$measure/population/text()};
      cto:units         {$measure/units/text()};
      cto:unitsAnalyzed {$measure/units_analyzed/text()};
      cto:param         {$measure/param/text()};
      cto:dispersion    {$measure/dispersion/text()}.
      {
        for $i at $n in $measure/analyzed_list/analyzed return
          local:rdf_analyzed ($url, $meas, fn:concat($meas,"/analyzed/",$n), $i),
        for $i at $n in $measure/class_list/class return
          local:rdf_class    ($url, $meas, fn:concat($meas,"/class/",$n), $i)
      }
    }
};

declare function local:rdf_class ($url as xs:string, $meas as xs:string, $cls as xs:string, $class) {
  let $url := $url
  construct {
    <{$meas}> cto:class <{$cls}>.
    <{$cls}> a cto:Class;
      dc:title {$class/title/text()}.
      {
        for $i at $n in $class/analyzed_list/analyzed return
          local:rdf_analyzed        ($url, $cls, fn:concat($cls,"/analyzed/",$n), $i),
        for $i at $n in $class/category_list/category return
          local:rdf_measureCategory ($url, $cls, fn:concat($cls,"/category/",$n), $i)
      }
    }
};

declare function local:rdf_measureCategory ($url as xs:string, $meas as xs:string, $cat as xs:string, $category) {
  let $url := $url
  construct {
    <{$meas}> cto:category <{$cat}>.
    <{$cat}> a cto:MeasureCategory;
      dc:title {$category/title/text()}.
      {
        for $i at $n in $category/measurement_list/measurement return
          local:rdf_measurement($url, $cat, fn:concat($cat,"/measurement/",$n), $i)
      }
    }
};

declare function local:rdf_measurement ($url as xs:string, $cat as xs:string, $meas as xs:string, $measurement) {
  let $url := $url
  construct {
    <{$cat}> cto:measurement <{$meas}>.
    <{$meas}> a cto:Measurement;
      cto:group <{fn:concat($url, "/group/", $measurement/@group_id/substring(.,2))}>;
      dc:title        {$measurement/text()};
      cto:value       {func:clean_number($measurement/@value/string())}^^xsd:decimal;
      cto:spread      {$measurement/@spread     /string()};
      cto:lowerLimit {func:clean_number($measurement/@lower_limit/string())}^^xsd:decimal;
      cto:upperLimit {func:clean_number($measurement/@upper_limit/string())}^^xsd:decimal;
    }
};

How would you express this in RML?

andimou commented 3 years ago

@frmichel

let me quote you

it is not possible to access the fields above the iteration, or in other words outside of the iterated part.

do you think this is an RML issue or an R2RML issue?

I guess that above comes from hierarchical data, thus RML, but outside the iteration touches R2RML as well. My concern is that trying to solve the "above" issue might be too specific to certain data structure but trying to solved the "outside" issue seems to me more universal.

frmichel commented 3 years ago

Hi @andimou, indeed the solution we seeking has to be generic enough to capture various use cases. I think the discussion needs to focus on the concept of "iteration model". In R2RML, the iteration model is very simple: relational table lines. This is the same for any kind of tabular data. So I can't really think of a case where R2RML would require such a feature of accessing data "above" or "outside" the iteration.

But this need comes as soon as we deal with hierarchical data, hence the case of RML with the rml:iterator that modifies the iteration model for the scope of a triples map. Even more in xR2RML where a where nestedTermMaps can change the iteration model for the scope of a single term map.

Now, wrt. the words we use, above/down or outside/inside, probably the latter is more general indeed. In this sense, the choice of the name xrr:pushDown was pragmatic but probably not so clever.

I'm wondering whether we could have cases where data would not be tabular nor hierarchical. If we query a graph database (other than an RDF database of course), the iteration could be on some sets of nodes that match a certain query pattern within the graph. Then I'm not sure what the "outside" term would mean here, but it is certainly more generic that the "above" one.

pmaria commented 3 years ago

My concern is that trying to solve the "above" issue might be too specific to certain data structure but trying to solved the "outside" issue seems to me more universal.

+1

@frmichel I'm trying to understand the xrr:pushDown behavior and how it would be implemented.

Given

{
  "records": [
    {
      "id": "1",
      "enteredBy": "Alice",
      "cars": [
        {
          "make": "Mercedes"
        },
        {
          "make": "Honda"
        }
      ]
    },
    {
      "id": "2",
      "enteredBy": "Bob",
      "cars": [
        {
          "make": "Mercedes"
        },
        {
          "make": "Toyota"
        }
      ]
    }
  ]
}

and logical source:

[] 
  rml:source "some-source.json" ;
  rml:referenceFormulation ql:JSONPath ;
  rml:iterator "records.[*].cars.*" ;
  xrr:pushDown [
    xrr:reference "records.[*].id" ;
    xrr:as "recordId" ;
  ]
.

What would the resulting iteration look like?

Would it be A

[
  {
    "recordId": "1",
    "make": "Mercedes"
  },
  {
    "recordId": "1",
    "make": "Honda"
  },
  {
    "recordId": "2",
    "make": "Mercedes"
  },
  {
    "recordId": "2",
    "make": "Toyota"
  }
]

Or B?

[
  {
    "recordId": [
      "1",
      "2",
    ],
    "make": "Mercedes"
  },
  {
    "recordId": [
      "1",
      "2",
    ],
    "make": "Honda"
  },
  {
    "recordId": [
      "1",
      "2",
    ],
    "make": "Mercedes"
  },
  {
    "recordId": [
      "1",
      "2",
    ],
    "make": "Toyota"
  }
]

I can see how B can be implemented generically and trivially, however, I don't believe B is as useful as A. Actually I struggle to think of a use case for B.

A however, in my opinion, is not trivial to implement. How do I know which value of the push down reference should be merged with the value of the iterator?

andimou commented 3 years ago

I think the discussion needs to focus on the concept of "iteration model". In R2RML, the iteration model is very simple: relational table lines. This is the same for any kind of tabular data. So I can't really think of a case where R2RML would require such a feature of accessing data "above" or "outside" the iteration.

I think this can occur in the case of tabular data too, e.g., when you want to refer to the 'above' line of a CSV

frmichel commented 3 years ago

Hi @pmaria,

The case you describe is interesting because it involves two iterations, not just one: one iteration on records, and one on cars of records.

The way the pushDown feature works is quite simple: at each iteration, it evaluates the xrr:reference expression and creates additional fields with the result of this evaluation. So in your example:

  xrr:pushDown [
    xrr:reference "records.[*].id" ;
    xrr:as "recordId" ;
  ]

records.[*].id will evaluate to all the ids in records, and there are two of them. Hence the result will be case B. And I agree, this one seems pretty useless because it mixes up everything, which is precisely what we usually want to avoid.

But what you want to do here is to push the id field two iteration levels down. In xR2RML you can do that with an iterator and a nested term map. First:

[] 
  rml:iterator "$.records.*" ;

That basically iterates on each individual record:

{
  "id": "1",
  "enteredBy": "Alice",
  "cars": [
    { "make": "Mercedes" },
    { "make": "Honda" }
]}
{
  "id": "2",
  "enteredBy": "Bob",
  "cars": [
    { "make": "Mercedes" },
    { "make": "Toyota" }
]}

Then, in your predicate-object maps you can iterate on cars:

rr:predicateObjectMap [
    rr:predicate ...
    rr:objectMap [
        xrr:reference "$.cars.*";
        xrr:pushDown [
            xrr:reference "$.id" ;
            xrr:as "recordId" ;
        ];
        xrr:nestedTermMap [
            rr:template "{$.recordId} {$.make}";
            rr:termType rr:Literal
        ];
    ];    
];

This takes you to case A and creates object terms: "1 Mercedes" "1 Honda" "2 Mercedes" "2 Toyota"

pmaria commented 3 years ago

@frmichel, Ah ✔️ , of course.