Closed elsdvlee closed 3 months ago
Actually, I'd expect this to be the index of the iteration / reference, not of the original file. So that's in line with how you're representing it right now, however if you have more complex JSONPaths that eg only take elements in an array that have a specific key, and that's only 4 elements in an array of 200 elements, then those 4 elements will have indexes 0/1/2/3, not indexes 24/56/99/189. Similarly, if you have a reference that returns only a single element (which is, in JSONPath, all elements that don't have arrays), then all those indexes will be 0
Spec-wise, you can then just specify this in the LogicalViews spec (indexes are incremented for each iteration and for each reference result) without needing to specify this specifically per reference formulation
Agree with Ben. The index is based on reference results and is independent of a specific reference formulation.
otherwise we cannot reconstruct the order.
@elsdvlee what did you mean by this?
otherwise we cannot reconstruct the order.
@elsdvlee what did you mean by this? I mean the support for collections and containers, reconstruct a list in the original order.
So, if I understand you correctly, in case of csv the correct indexes are: | # | name.# | name | birthyear.# | birthyear |
---|---|---|---|---|---|
0 | 0 | alice | 0 | 1995 | |
1 | 1 | bob | 1 | 1999 | |
2 | 2 | tobias | 2 | 2005 |
In case of json, you restart counting from 0 as a result of the json path expression used as reference.
No, for CSV the correct indexes are, I think:
# | name.# | name | birthyear.# | birthyear |
---|---|---|---|---|
0 | 0 | alice | 0 | 1995 |
1 | 0 | bob | 0 | 1999 |
2 | 0 | tobias | 0 | 2005 |
Because it's each time the 'first' name/birthyear of that row
That is exactly my point. In case of csv inside json (mixed format) we skip one step: we make additional iterations and from those iterations we take the reference, so should we then have 0 as index here as well? In that case we cannot know the order anymore.
I miss this step:
# | name | name.# | hobbies | hobbies.# | hobby_row | hobby_row.# | hobby_row.id | hobby_row.id.# | hobby_row.type | hobby_row.type.# |
---|---|---|---|---|---|---|---|---|---|---|
0 | alice | 0 | (...) | 0 | (…) | 0 | 1 | 0 | volleybal | 0 |
0 | alice | 0 | (...) | 0 | (…) | 1 | 2 | 0 | basketball | 0 |
0 | alice | 0 | (...) | 0 | (…) | 2 | 3 | 0 | horses | 0 |
1 | bob | 0 | (...) | 0 | (…) | 0 | 1 | 0 | football | 0 |
Instead of this:
# | name | name.# | hobbies | hobbies.# | hobbies.id | hobbies.id.# | hobbies.type | hobbies.type.# |
---|---|---|---|---|---|---|---|---|
0 | alice | 0 | (...) | 0 | 1 | 0 | volleybal | 0 |
0 | alice | 0 | (...) | 0 | 2 | 0 | basketball | 0 |
0 | alice | 0 | (...) | 0 | 3 | 0 | horses | 0 |
1 | bob | 0 | (...) | 0 | 1 | 0 | football | 0 |
I would expect that hobbies.# then would increment, because hobbies
doesn't result in a single field, but in multiple iterations
@elsdvlee It is hard to follow your example without the field definitions.
In general, this is how it should work:
A field's index is coupled to its parent. When you create a new child field for a "parent" the index for the child field starts at 0, and will increment for each item in the expression result value of that child field's expression.
@pmaria here is the corresponding rml mapping:
@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .
:jsonSource a rml:LogicalSource ;
rml:source "./test_cases/json_csv_data.json";
rml:referenceFormulation rml:JSONPath ;
rml:iterator "$.people[*]" .
:jsonView a rml:LogicalView ;
rml:onLogicalSource :jsonSource ;
rml:field [
rml:fieldName "name" ;
rml:reference "$.name" ;
] ;
rml:field [
rml:fieldName "hobbies" ;
rml:reference "$.hobbies" ;
rml:referenceFormulation rml:CSV;
rml:field [
rml:fieldName "type" ;
rml:reference "type" ;
] ;
rml:field [
rml:fieldName "id" ;
rml:reference "id" ;
] ;
] .
and the source data:
{
"people": [
{
"name": "alice",
"hobbies": "id;type\n1;volleybal\n2;basketball\n3;horses"
},
{
"name": "bob",
"hobbies": "id;type\n1;football"
}
]
}
Thanks @elsdvlee, I now see your point.
When we switch reference formulation the iteration context also changes.
I think we need to explore this a bit further, but I can imagine that when switching reference formulation we could also require(?) specifying iterator, or use the default for the reference formulation.
So then you'd get:
@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .
:jsonSource a rml:LogicalSource ;
rml:source "./test_cases/json_csv_data.json";
rml:referenceFormulation rml:JSONPath ;
rml:iterator "$.people[*]" .
:jsonView a rml:LogicalView ;
rml:onLogicalSource :jsonSource ;
rml:field [
rml:fieldName "name" ;
rml:reference "$.name" ;
] ;
rml:field [
rml:fieldName "hobbies" ;
rml:reference "$.hobbies" ;
rml:referenceFormulation rml:CSV;
# implicit iterator is CSV row
rml:field [
rml:fieldName "type" ;
rml:reference "type" ;
] ;
rml:field [
rml:fieldName "id" ;
rml:reference "id" ;
] ;
] .
{
"people": [
{
"name": "alice",
"hobbies": "id;type\n1;volleybal\n2;basketball\n3;horses"
},
{
"name": "bob",
"hobbies": "id;type\n1;football"
}
]
}
# | name | name.# | hobbies | hobbies.# | hobbies.id | hobbies.id.# | hobbies.type | hobbies.type.# |
---|---|---|---|---|---|---|---|---|
0 | alice | 0 | 1;volleybal | 0 | 1 | 0 | volleybal | 0 |
0 | alice | 0 | 2;basketball | 1 | 2 | 0 | basketball | 0 |
0 | alice | 0 | 3;horses | 2 | 3 | 0 | horses | 0 |
1 | bob | 0 | 1;football | 0 | 1 | 0 | football | 0 |
Agree with the latest comment of Pano: include iterator every time you change reference formulations (or use the default if there is one).
I have done some tests, and think that indeed the obligatory iterator (implicate in case of tabular data) whenever the reference formulation changes, seems to solve the issue. I think that we need to mention in the spec that first the expression map is resolved, and after that the reference formulation + iterator. So in the above case first resolve rml:reference "$.hobbies"
and then split the result in rows.
Here an example in the opposite direction, JSON data in CSV file:
mapping.ttl
@prefix rml: <http://w3id.org/rml/> .
@prefix : <http://example.org/> .
:csvSource a rml:LogicalSource ;
rml:source "./test_cases/POCLV0004b/csv_json_data.csv";
rml:referenceFormulation rml:CSV .
:csvView a rml:LogicalView ;
rml:onLogicalSource :csvSource ;
rml:field [
rml:fieldName "name" ;
rml:reference "name" ;
] ;
rml:field [
rml:fieldName "parents" ;
rml:reference "parents" ;
rml:referenceFormulation rml:JSONPath ;
rml:iterator "$";
rml:field [
rml:fieldName "mother" ;
rml:reference "$.mother" ;
] ;
] ;
rml:field [
rml:fieldName "items" ;
rml:reference "items" ;
rml:referenceFormulation rml:JSONPath ;
rml:iterator "$[*]" ;
rml:field [
rml:fieldName "type" ;
rml:reference "$.type" ;
] ;
] .
Source data:
name,parents,items
alice,"{""mother"":""maria"",""father"":""joseph""}","[{""type"":""sword"",""weight"":1500},{""type"":""shield"",""weight"":2500}]"
bob,"{""mother"":""anna""}","[{""type"":""flower"",""weight"":15}]"
tobias,"{}","[]"
Result:
| # | name | name.# | parents | parents.# | parents.mother | parents.mother.# | items | items.# | items.type | items.type.# |
| - | ------ | ------ | ------------------------------------ | --------- | -------------- | ---------------- | ---------------------------------- | ------- | ---------- | ------------ |
| 0 | alice | 0 | {"mother":"maria","father":"joseph"} | 0 | maria | 0 | {"type": "sword", "weight": 1500} | 0 | sword | 0 |
| 0 | alice | 0 | {"mother":"maria","father":"joseph"} | 0 | maria | 0 | {"type": "shield", "weight": 2500} | 1 | shield | 0 |
| 1 | bob | 0 | {"mother":"anna"} | 0 | anna | 0 | {"type": "flower", "weight": 15} | 0 | flower | 0 |
| 2 | tobias | 0 | {} | 0 | | | | | | |
agreed in CG meeting 03/20/2024 that this approach is a good solution
However, is this in line with how we calculated the index for json file? There the index is 0 if there is only on value for the logical iteration. In case of csv there standard only one value for the logical iteration, so then the index of all fields should be 0, except for the overall index #.
And what to do with a nested csv column? In that case it makes more sense to maintain the index of the row, otherwise we cannot reconstruct the order.
Should we define the algorithm to calculate the index for a field per reference formulation?