kg-construct / rml-io

RML-IO: Input/Output declarations for RML
https://w3id.org/rml/io/spec
Creative Commons Attribution 4.0 International
1 stars 4 forks source link

Paginated sources #8

Closed DylanVanAssche closed 1 year ago

DylanVanAssche commented 2 years ago

Paginated JSON/XML/... is used extensively in the wild for Web APIs. However, we don't support this in the current Logical Source.

See:

JBPressac commented 2 years ago

Thank you !

dachafra commented 2 years ago

is this relevant for RML? Do we want to include it in the spec? I mean, databases also include pagination (offset + limit) but we do not include it in the spec as IMO it is more related to the capabilities of the underlying source than the actual access to the data

DylanVanAssche commented 2 years ago

@dachafra Pagination may be 'easier' in relational databases, but in Web APIs this is a huge mess as everyone does this differently. If you want to map Web APIs, this is a big requirement to have some adoption.

Most of the time (at least what I have seen, every time), the response contains a link or a next page number number to the next page. Logical Sources should have some way of declaring how a next page URL can be generated. If a Logical Source knows that, it can handle everything in the implementation.

IMO, this is an access problem so it would be description like WoT/D2RQ/CSVW/... which specifies the next page URL in some way. So this would end up in the Source, not in the Logical Source.

andimou commented 2 years ago

we need to distinguish between API pagination and JSON as pagination might be more relevant for the source description whereas if it's JSON or other format it is more a concern of the logical source.

Then question would be, are there well-known (ideallly W3C recommended) vocabularies that describe pagination where we could point to if we want to suggest options to describe paginated data sources?

We should not forget that in the RML spec we don't want to be complete with respect to all potential data source descriptions. We can conclude on recommending a few and provide the means for people to include any data source they like (in that case paginated data sources) but we should not aim to be exhaustive (I think)

JBPressac commented 2 years ago

Hello, There is a pagination example in the JSON:API specification: https://jsonapi.org/examples/#pagination and a profile to avoid many of the pitfalls of “offset–limit” pagination. But this could be unknown by some (most ?) API programmers.

DylanVanAssche commented 2 years ago

Then question would be, are there well-known (ideallly W3C recommended) vocabularies that describe pagination where we could point to if we want to suggest options to describe paginated data sources?

I'm not aware of any other approaches for this, if anyone knows any, please comment ;)

We should not forget that in the RML spec we don't want to be complete with respect to all potential data source descriptions. We can conclude on recommending a few and provide the means for people to include any data source they like (in that case paginated data sources) but we should not aim to be exhaustive (I think)

Indeed, we cannot cover everything. However, paginated sources are really common, could be seen as a batch stream with instructions on how to get the next batch. If we have a proper description on how to get the next batch, all the stuff is covered.

bjdmeest commented 2 years ago

I think this can be put more in general, i.e., 'multi-source' sources (not sure how to call that). But things such as "map source files via a glob" are related imo (e.g., "the data source files are ./files/addresses_*.csv").

So I'm envisioning some kind of 'multi-source strategy' description. However, I think this needs to be part of the Logical Source, not the Data Access.

If I can see the 'result' of rml:access a bytestream returned from a session as described in rml:access, which is then specified as a stream that can be interpreted by the logical source (i.e., setting encoding and compression), then this actually fits within logical source, but things get a bit hairy:

<#CSVSourceAccess> a csvw:Table;
  csvw:url "addresses-1.csv"; # do you need to specify the 'first' file in advance?
  csvw:dialect [ a csvw:Dialect;
    csvw:delimiter ";";
    csvw:encoding "UTF-8";
    csvw:header "1"^^xsd:boolean;
  ];
.

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source <#CSVSourceAccess>;
    rml:multiSourceStrategy [
      a multiSource:GlobStrategy ;
      multiSource:pattern "addresses-*.csv"
    ]
  ];

This makes a weird interdependency between Data Source and LogicalSource, as the datasource(s) can be dynamically discovered. However, for pagination, this feels a bit more straight-forward:

<#WoTWebAPISource> a td:PropertyAffordance;
  td:hasForm [
    # URL and content type
    hctl:hasTarget "http://api.irail.be/stations?format=json";
    hctl:forContentType "application/json";
    # Read only
    hctl:hasOperationType td:readproperty;
    # Set HTTP method and headers
    htv:methodName "GET";
    htv:headers ([
      htv:fieldName "User-Agent";
      htv:fieldValue "RMLMapper";
    ]);
  ];
.

<#WoTWebAPI> a td:Thing;
  td:hasPropertyAffordance <#WoTWebResource>;
.

<#TriplesMap> a rr:TriplesMap;
  rml:logicalSource [ a rml:LogicalSource;
    rml:source <#WoTWebAPISource>;
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.station.[*]";
    rml:multiSourceStrategy [
      a multiSource:JSONAPI # Doesn't need additional config bc JSONAPI links are defined within the spec
    ]
  ];
.

In this case, you don't need to duplicate the URL within the rml:multiSourceStrategy bc the 'next' data source is discovered from the previous ones. You also don't need to touch the WoT description (so the same pagination strategy can be reused whether you fetch data via a simple HTTP get, or some complex authenticated WoT request). This also shows that a dependency between Data Access (specifying application/json contenttype) and Logical Source (specifying ql:JSONPath) already exists. So then it's not a big problem to support multi-sources like this (i.e., as part of the logical source description)?

Note for myself, some pagination strategies use HTTP headers to set next links etc (so then it's not part of the actually 'response body'), but then putting it in the logical source still makes sense imo

Concrete proposal: create an extension point in the RML source/target spec (something like rml:multiSourceStrategy), so that custom extensions can be made, similar to rml:source.

DylanVanAssche commented 1 year ago

I gave this some thought and also had some discussions during meetings about pagination. The idea would be to leave this a bit open by allowing any kind of rml:access description. This way, you can use WoT, CSVW, DCAT, SD, etc. for rml:access but also use your custom ontology. The latter could then cover these special cases.

What do you think?

DylanVanAssche commented 1 year ago

This will be solved together with #18

DylanVanAssche commented 1 year ago

Implemented in f773d4b