openbudgets / pipeline-fragments

Reusable fragments of LinkedPipes ETL pipelines
2 stars 3 forks source link

FDP2RDF: Support providing `datapackage.json` as a URL and not verbatim #5

Closed akariv closed 7 years ago

akariv commented 8 years ago

The DataPackage specification (on which the Fiscal Data Package is based) specifies that data packages are to be referred with URLs. The reason for that is that a datapackage consists of the resources as well as the descriptor file (datapackage.json). When you point to the descriptor or the root of the package you also provide information regarding the rest of the contents of the package.

This means that most references from the datapackage.json to its resources are usually done with relative URLs, and not with the fully qualified ones.

The current implementation that requires the json file to be uploaded as POST data, doesn't convey the origin of the file, which goes against the principles of datapackage as well as making it impossible to locate the resources in case the paths are relative.

HimmelStein commented 8 years ago

@akariv do you mean that in the curl command we shall use the url of a datapackage.jsonld instead of the datapackage.jsonld itself? and the value of the 'path' shall be the relative to the url of the datapackage.jsonld? if yes, datapackage.jsonld and csv files shall be located at the same place.

akariv commented 8 years ago

@HimmelStein yes, that is correct.

CSV files and datapackage.jsonld are usually located at the same location as they are part of the same data package.

marek-dudas commented 8 years ago

The current state was agreed on before. Everything can be changed, but I think that LinkedPipes support only file input in POST and not GET parameters. I.e., you could POST the descriptor.json file URL in a simple plain-text or RDF file - that could be done immediately. I will have to check the possibilities of including parameters directly into the GET/POST request with Jakub.

jindrichmynarz commented 8 years ago

I think you can POST the datapackage.json's URL, convert it the an RDF triple using t-filesToStatements, and use SPARQL CONSTRUCT to build RDF configuration out of the triple for e-httpGetFiles. It's a bit convoluted, but should be workable now.

This can be made simpler if OpenSpending hook can POST data in RDF instead of a mere URL. For example, in JSON-LD:

{
  "@context": {
     "@vocab": "http://schema.org/"
  },
  "url": "http://some.where/datapackage.json"
}

JSON-LD can be directly ingested as RDF, so resorting to the trickery above is not needed.

marek-dudas commented 8 years ago

What @jindrichmynarz wrote is what I had in mind. I can implement it like that. @HimmelStein and @akariv: do you agree?

akariv commented 8 years ago

Yes, if it makes things easier then it's absolutely no problem to send RDF in the POST body. What should be the Content Type for such a request?

HimmelStein commented 8 years ago

@marek-dudas @akariv @jindrichmynarz I try to understand Jindrich's idea more clearly. The json-ld file has only two keys "@context" and "url", all other information is pointed by the value of the "url". CSV files in the datapackage.json use relative path (relative to "http://some.where/"). anything corrections?

marek-dudas commented 8 years ago

I think that is correct.

jindrichmynarz commented 8 years ago

The JSON-LD example in my comment represents only 1 triple:

_:b0 <http://schema.org/url> "http://some.where/datapackage.json" .

You can see how JSON-LD expands to RDF in the JSON-LD Playground (see the N-Quads tab).

As a side note, should we want to have the datapackage.json URL not as a literal and instead treat it as an RDF resource, we can use the following JSON-LD:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "url": {"@type": "@id"}
  },
  "url": "http://some.where/datapackage.json"
}

Regarding the resolution of the relative URLs in the datapackage.json, @marek-dudas can either parse the URL of the datapackage.json to obtain the base URL or the base URL can be explicitly provided in the JSON-LD input to the FDP2RDF pipeline using the @base attribute:

{
  "@context": {
    "@base": "http://some.where/",
    "@vocab": "http://schema.org/",
    "url": {"@type": "@id"}
  },
  "url": "datapackage.json"
}

@akariv: Regarding the content type header for JSON-LD payload, the standard is application/ld+json (see the spec). In the case of the FDP2RDF pipeline, the content type of the POST body will be ignored and instead manually hard-coded in the pipeline, so it is not strictly necessary to provide it.

jindrichmynarz commented 8 years ago

Thinking over this again I realized that the @base attribute is of no help in establishing a base URI, because it is transparent when processed as RDF (@base is only a syntactical artefact of JSON-LD). @marek-dudas would probably need to implement something like urllib.parse.urljoin in SPARQL.

marek-dudas commented 8 years ago

The pipeline should now support datapackage descriptor URL sent in a jsonld file as discussed above. See readme for more details.

jindrichmynarz commented 8 years ago

@marek-dudas: Can you explain why the name datapackage.jsonld is required?

marek-dudas commented 8 years ago

The pipeline has a simple file filter in the beginning, switching between the "datapackage descriptor posted directly" and "just the URL of the descriptor posted" inputs based on the filename. Less user friendly but also less error-prone in my opinion. I think it would take some time to enable arbitrary filename, since the pipeline would have to first look into the file and determine if it is a datapackage descriptor or just its url according to the content. And since LinkedPipes AFAIK does not support if/then/else nodes, it might get quite complicated.

jindrichmynarz commented 8 years ago

I see. I thought the support of posting datapackage.json directly was dropped in favour of posting the download instructions.

marek-dudas commented 8 years ago

Just a reminder: the proposal has been implemented and documented some time ago, so feel free to test it on Fraunhofer server and close the issue eventually.