kg-construct / rml-lv

Specification repository for logical views in RML.
https://kg-construct.github.io/rml-lv/dev.html
3 stars 3 forks source link

index in case of (nested) csv #20

Closed elsdvlee closed 3 months ago

elsdvlee commented 8 months ago
How to calculate the index in case of csv? Until now I supposed that we keep the index of the row # name.# name birthyear.# birthyear
0 0 alice 0 1995
1 1 bob 1 1999
2 2 tobias 2 2005

However, is this in line with how we calculated the index for json file? There the index is 0 if there is only on value for the logical iteration. In case of csv there standard only one value for the logical iteration, so then the index of all fields should be 0, except for the overall index #.

# name.# name birthyear.# birthyear
0 0 alice 0 1995
1 0 bob 0 1999
2 0 tobias 0 2005

And what to do with a nested csv column? In that case it makes more sense to maintain the index of the row, otherwise we cannot reconstruct the order.

{
  "people": [
    {
      "name": "alice",
      "hobbies": "id;type\n1;volleybal\n2;basketball\n3;horses"
    },
    {
      "name": "bob",
      "hobbies": "id;type\n1;football"
    }
  ]
}
# name name.# hobbies hobbies.# hobbies.id hobbies.id.# hobbies.type hobbies.type.#
0 alice 0 (...) 0 1 0 volleybal 0
0 alice 0 (...) 0 2 1 basketball 1
0 alice 0 (...) 0 3 2 horses 2
1 bob 0 (...) 0 1 0 football 0

Should we define the algorithm to calculate the index for a field per reference formulation?

bjdmeest commented 8 months ago

Actually, I'd expect this to be the index of the iteration / reference, not of the original file. So that's in line with how you're representing it right now, however if you have more complex JSONPaths that eg only take elements in an array that have a specific key, and that's only 4 elements in an array of 200 elements, then those 4 elements will have indexes 0/1/2/3, not indexes 24/56/99/189. Similarly, if you have a reference that returns only a single element (which is, in JSONPath, all elements that don't have arrays), then all those indexes will be 0

Spec-wise, you can then just specify this in the LogicalViews spec (indexes are incremented for each iteration and for each reference result) without needing to specify this specifically per reference formulation

pmaria commented 8 months ago

Agree with Ben. The index is based on reference results and is independent of a specific reference formulation.

otherwise we cannot reconstruct the order.

@elsdvlee what did you mean by this?

elsdvlee commented 8 months ago

otherwise we cannot reconstruct the order.

@elsdvlee what did you mean by this? I mean the support for collections and containers, reconstruct a list in the original order.

elsdvlee commented 8 months ago
So, if I understand you correctly, in case of csv the correct indexes are: # name.# name birthyear.# birthyear
0 0 alice 0 1995
1 1 bob 1 1999
2 2 tobias 2 2005

In case of json, you restart counting from 0 as a result of the json path expression used as reference.

bjdmeest commented 8 months ago

No, for CSV the correct indexes are, I think:

# name.# name birthyear.# birthyear
0 0 alice 0 1995
1 0 bob 0 1999
2 0 tobias 0 2005

Because it's each time the 'first' name/birthyear of that row

elsdvlee commented 8 months ago

That is exactly my point. In case of csv inside json (mixed format) we skip one step: we make additional iterations and from those iterations we take the reference, so should we then have 0 as index here as well? In that case we cannot know the order anymore.

I miss this step:

# name name.# hobbies hobbies.# hobby_row hobby_row.# hobby_row.id hobby_row.id.# hobby_row.type hobby_row.type.#
0 alice 0 (...) 0 (…) 0 1 0 volleybal 0
0 alice 0 (...) 0 (…) 1 2 0 basketball 0
0 alice 0 (...) 0 (…) 2 3 0 horses 0
1 bob 0 (...) 0 (…) 0 1 0 football 0

Instead of this:

# name name.# hobbies hobbies.# hobbies.id hobbies.id.# hobbies.type hobbies.type.#
0 alice 0 (...) 0 1 0 volleybal 0
0 alice 0 (...) 0 2 0 basketball 0
0 alice 0 (...) 0 3 0 horses 0
1 bob 0 (...) 0 1 0 football 0
bjdmeest commented 8 months ago

I would expect that hobbies.# then would increment, because hobbies doesn't result in a single field, but in multiple iterations

pmaria commented 8 months ago

@elsdvlee It is hard to follow your example without the field definitions.

In general, this is how it should work:

A field's index is coupled to its parent. When you create a new child field for a "parent" the index for the child field starts at 0, and will increment for each item in the expression result value of that child field's expression.

elsdvlee commented 8 months ago

@pmaria here is the corresponding rml mapping:

@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .

:jsonSource a rml:LogicalSource ;
  rml:source "./test_cases/json_csv_data.json";
  rml:referenceFormulation rml:JSONPath ;
  rml:iterator "$.people[*]" .

:jsonView a rml:LogicalView ;
  rml:onLogicalSource :jsonSource ;
  rml:field [
    rml:fieldName "name" ;
    rml:reference "$.name" ;
  ] ;
  rml:field [
    rml:fieldName "hobbies" ;
    rml:reference "$.hobbies" ;
    rml:referenceFormulation rml:CSV;
    rml:field [
      rml:fieldName "type" ;
      rml:reference "type" ;
    ] ;
    rml:field [
      rml:fieldName "id" ;
      rml:reference "id" ;
    ] ;
  ] .

and the source data:

{
  "people": [
    {
      "name": "alice",
      "hobbies": "id;type\n1;volleybal\n2;basketball\n3;horses"
    },
    {
      "name": "bob",
      "hobbies": "id;type\n1;football"
    }
  ]
}
pmaria commented 8 months ago

Thanks @elsdvlee, I now see your point.

When we switch reference formulation the iteration context also changes.

I think we need to explore this a bit further, but I can imagine that when switching reference formulation we could also require(?) specifying iterator, or use the default for the reference formulation.

So then you'd get:

@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .

:jsonSource a rml:LogicalSource ;
  rml:source "./test_cases/json_csv_data.json";
  rml:referenceFormulation rml:JSONPath ;
  rml:iterator "$.people[*]" .

:jsonView a rml:LogicalView ;
  rml:onLogicalSource :jsonSource ;
  rml:field [
    rml:fieldName "name" ;
    rml:reference "$.name" ;
  ] ;
  rml:field [
    rml:fieldName "hobbies" ;
    rml:reference "$.hobbies" ;
    rml:referenceFormulation rml:CSV;

    # implicit iterator is CSV row

    rml:field [
      rml:fieldName "type" ;
      rml:reference "type" ;
    ] ;
    rml:field [
      rml:fieldName "id" ;
      rml:reference "id" ;
    ] ;
  ] .
{
  "people": [
    {
      "name": "alice",
      "hobbies": "id;type\n1;volleybal\n2;basketball\n3;horses"
    },
    {
      "name": "bob",
      "hobbies": "id;type\n1;football"
    }
  ]
}
# name name.# hobbies hobbies.# hobbies.id hobbies.id.# hobbies.type hobbies.type.#
0 alice 0 1;volleybal 0 1 0 volleybal 0
0 alice 0 2;basketball 1 2 0 basketball 0
0 alice 0 3;horses 2 3 0 horses 0
1 bob 0 1;football 0 1 0 football 0
bjdmeest commented 7 months ago

Agree with the latest comment of Pano: include iterator every time you change reference formulations (or use the default if there is one).

elsdvlee commented 7 months ago

I have done some tests, and think that indeed the obligatory iterator (implicate in case of tabular data) whenever the reference formulation changes, seems to solve the issue. I think that we need to mention in the spec that first the expression map is resolved, and after that the reference formulation + iterator. So in the above case first resolve rml:reference "$.hobbies" and then split the result in rows.

Here an example in the opposite direction, JSON data in CSV file:
mapping.ttl

@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .

:csvSource a rml:LogicalSource ;
  rml:source "./test_cases/POCLV0004b/csv_json_data.csv";
  rml:referenceFormulation rml:CSV .

:csvView a rml:LogicalView ;
  rml:onLogicalSource :csvSource ;
  rml:field [
    rml:fieldName "name" ;
    rml:reference "name" ;
  ] ;
  rml:field [
    rml:fieldName "parents" ;
    rml:reference "parents" ;
    rml:referenceFormulation rml:JSONPath ;
    rml:iterator "$";
    rml:field [
      rml:fieldName "mother" ;
      rml:reference "$.mother" ;
    ] ;
  ] ;
  rml:field [
    rml:fieldName "items" ;
    rml:reference "items" ;
    rml:referenceFormulation rml:JSONPath ;
    rml:iterator "$[*]" ;
    rml:field [
       rml:fieldName "type" ;
       rml:reference "$.type" ;
    ] ;
  ] .

Source data:

name,parents,items
alice,"{""mother"":""maria"",""father"":""joseph""}","[{""type"":""sword"",""weight"":1500},{""type"":""shield"",""weight"":2500}]"
bob,"{""mother"":""anna""}","[{""type"":""flower"",""weight"":15}]"
tobias,"{}","[]"

Result:

| # | name   | name.# | parents                              | parents.# | parents.mother | parents.mother.# | items                              | items.# | items.type | items.type.# |
| - | ------ | ------ | ------------------------------------ | --------- | -------------- | ---------------- | ---------------------------------- | ------- | ---------- | ------------ |
| 0 | alice  | 0      | {"mother":"maria","father":"joseph"} | 0         | maria          | 0                | {"type": "sword", "weight": 1500}  | 0       | sword      | 0            |
| 0 | alice  | 0      | {"mother":"maria","father":"joseph"} | 0         | maria          | 0                | {"type": "shield", "weight": 2500} | 1       | shield     | 0            |
| 1 | bob    | 0      | {"mother":"anna"}                    | 0         | anna           | 0                | {"type": "flower", "weight": 15}   | 0       | flower     | 0            |
| 2 | tobias | 0      | {}                                   | 0         |                |                  |                                    |         |            |              |
pmaria commented 7 months ago

agreed in CG meeting 03/20/2024 that this approach is a good solution