Import datasets from Github Repository into Virtuoso

schmaluk commented 7 years ago

In our tech there was the idea to connect a github repository with our Virtuoso triplestore. Here we can discuss if that is needed at all, can be implemented and specify how such an import should look like.

The use case is: Municipalities which can handle the data transformation in our OBEU-RDF-format themselves can upload their datasets to our triplestore.

So whenever datasets are being pushed to the repo an import should be triggered. The import should handle:

deltas of the datasets (only datasets with a change or new ones should be imported)
What kind of repository should be used?
Should the datasets have an owner? Otherwise someone could specify a wrong graphname and the import will replace the graph)

Currently we have 2 methods so far for importing datasets to Virtuoso imo. a) via Pipelines and LinkedPipes installations on the server b) via Fhg handling the import (trig or ttl/ttl.graph-format)

schmaluk commented 7 years ago

There should be maybe at least some kind of check that the graphnames of each repository or even within a repository are unique. Otherwise the imported graphs will replace each other in Virtuoso.

jindrichmynarz commented 7 years ago

Municipalities will use the FDP2RDF pipeline, which can either directly import to Virtuoso or produce a file dump that will be loaded to Virtuoso.

What is the use case of involving Git in this workflow?

skarampatakis commented 7 years ago

Will municipalities use only the FDP2RDF pipeline? The workflow is to first upload the CSV only file to OS Packager?

jindrichmynarz commented 7 years ago

That's what I understood.

skarampatakis commented 7 years ago

I think it is just a use case, not the only one. And that was the answer we got on the tech call.

pwalsh commented 7 years ago

@skarampatakis the integration of the RDF pipeline requires CSV files to be added via the openspending packager, yes.

Also note that if municipalities have large or complex transformations from one or many tabular data sources, they can also use openspending's native pipeline framework, which does not have the performance restrictions of linked pipes, but has the con (in terms of piloting OBEU) of not loading the data to the triple store (it loads into OS's SQL and Elasticsearch backends though, and provides a very rich API to the data).

pwalsh commented 7 years ago

pipeline monitoring UI: http://staging.openspending.org/pipelines/
examples of pipelines generated from GitHub repos ((with some rather complex transformations)):
- https://github.com/os-data/eu-structural-funds
- https://github.com/os-data/mexican-federal-budget
code: https://github.com/frictionlessdata/datapackage-pipelines
an example of a simple pipeline configuration: https://github.com/os-data/eu-structural-funds/blob/master/data/DE.germany/DE3.berlin/ESF%202007-2013/source.description.yaml#L1

schmaluk commented 7 years ago

Hi, thanks for giving feedback. Yes this was meant as an additional feature discussed at the last tech call. Hope that I have understood @larjohn correctly here. Some municipalities seem to be able to provide datasets directly in our OBEU RDF format themselves without the need to use the FDP-to-RDF-pipeline via the OS-packager and maybe adding richer information to the RDF datasets. And I just wanted to check with you if this can & should be done at all and the way it should work.

jindrichmynarz commented 7 years ago

OK, some further thoughts on your questions:

deltas of the datasets (only datasets with a change or new ones should be imported)

I think it would be easier to drop and reimport each dataset that was affected by a commit.

Deltas can work if the data is stored in a line-based format that can be diffed easily, such as N-Quads. However, combining Git with scripts that can resolve diffs to an RDF store can get messy. If we really want data storage with versioning, then we would be better off using a versioned RDF store, such as R&W Store, or at least a specified protocol, such as by using the eccrev vocabulary. Nevertheless, I think these approaches are not yet mature enough, and since this is not a research topic for OpenBudgets.eu, I would consider it outside of the scope of the project.

What kind of repository should be used?

Do you want to give the municipalities push access to https://github.com/openbudgets/datasets? I think each municipality can have its own repository and you can provide them with a Git post-commit hook that would do the synchronization work.

Should the datasets have an owner? Otherwise someone could specify a wrong graphname and the import will replace the graph)

If there are one-to-one correspondences between repositories of municipalities and their RDF stores, this problem would be mitigated.

pwalsh commented 7 years ago

@jindrichmynarz @skarampatakis @badmotor

Should this be kept open, for action, or not.

If it should be actioned, let's assign a single person with responsibility.

pwalsh commented 7 years ago

Closing, as no clear responsibility.

openbudgets / platform

Import datasets from Github Repository into Virtuoso #24