schmaluk commented 8 years ago

We Need to upload our existing OBEU-datasets: to our Server in an automated way (ideally whenever the datasets are updated in the github-repository). They should become available as plain file here: and also in the staging triplestore in a named graph.

jindrichmynarz commented 8 years ago

Instead of syncing the dataset dumps, we should import the pipelines from and re-create the datasets on the server.

Doing this also tests that each dataset is reproducible given the provided pipeline. If the imported pipelines don't work, please raise an issue on the datasets repository.

schmaluk commented 8 years ago

Thanks @jindrichmynarz. Will try to do that. Then this should probably not be automated, if this needs a human eye to supervise.

jindrichmynarz commented 8 years ago

Maybe I misunderstand you, but the pipelines themselves are what automate the ETL processes. Do you want to automate also the pipeline import?

schmaluk commented 8 years ago

Yes, I thought initially that was a good idea for reducing work. Whenever sthg is pushed to github. But now I think it is probably better to do it manually for now for validation.

jindrichmynarz commented 8 years ago

Theoretically, you could retrieve the pipelines in *.jsonld files using the GitHub's API and import them to LP-ETL via its API.

jakubklimek commented 8 years ago

I think we should first define the intended workflow as I think this is the other way around. First a pipeline producing a dataset is developed locally, then it should be deployed to OBEU server LP-ETL instance (manually, to adjust/check output directories and triplestore settings) where it produces the data and stores it. Then this pipeline and possibly its input and output (if not too large) is stored in GitHub. GitHub therefore serves as an archive, not as an input.

jindrichmynarz commented 8 years ago

I think GitHub should serve us primarily as means to sharing pipelines. ETL developers should implement pipelines locally, push the pipelines to GitHub, out of which they should be imported to the staging server and replicated.

schmaluk commented 8 years ago

My first thought was that just the produced datasets can be grabbed on github and imported (maybe even without LP). This could get fully automated. But then the Server/user does not check, if the Pipelines are working correctly... 1) If you use LP on the OBEU-server in the first place and github as Archive, that is probably more responsive for the dataset/etl developer, since he can check the datasets also in the triplestores immediately. But errors get detected by other people more likely afterwards when they are shared to the public on github. Also all dataset/etl developers Need to get ssh Access for this workflow. 2) With the other approach they can be checked by others before. And after being approved by other ppl by reviewing on github, can be imported to the Server manually (by me). If this is done by me however, it might be even less responsive, since Im not an expert on datasets and Need to raise issues, when I detect them.

I am neutral. We can do, whatever works best for you.

jindrichmynarz commented 8 years ago

I think we should adopt a workflow similar to your option 2: two people should be involved. The pipelines should be made reusable without much knowledge about their internals. Necessary configuration or prerequisites should be documented.

jakubklimek commented 8 years ago

Option 2 makes sense if:

  1. There will be a substantial amount of new datasets, for which new manual pipelines will be created. Is this planned? T2.1 and T2.2 have already ended, so this would have to be only to support analytics and data mining in the next 10 months on top of datasets, that will not come from OpenSpending. Otherwise it is not worth it to establish a proper workflow.
  2. There will be a proper validation workflow established, with clear definition of who is responsible for the review of the pipeline itself, and the involved steps, similarly to the Deliverable workflow. Since even some of the project Deliverables did not follow the workflow, I don't see it as realistic to have such workflow for the manually created pipelines (I assume there will be more datasets than deliverables). Typically as it is now, a developer creates the pipeline, uploads it and the data and that's it. Maybe someone notices problems in the resulting data and notifies the developer to fix it. But in reality, will pipeline developers (i.e. UEP, OKFGR, UBONN, IAIS) import pipelines made by others to validate them? Or is it really just for archivation? This makes sense for reusable pipeline fragments, but I am not sure about the resulting pipelines.
  3. There is a process for injecting credentials into the pipelines, as the credentials will not be stored in the GitHub version of the pipelines. This means some restrictions on the shape of the pipelines. This would not be necessary if the developer would be responsible for adjusting the pipeline before loading it manually to the OBEU instance.

To sum up, if we think there will be at least tens of new manually created datasets for data analytics and data mining during the next 10 months and we want to add the validation workflow for them, option 2 is worth investigating. Otherwise I would stick with option 1.

schmaluk commented 8 years ago

So do we agree then that each ETL/datasets-Pipeline-developer will upload their pipelines themselves to the OBEU-server or am I wrong? However I can assist/take over, if it just means to upload the pipelines from our git-repo & they are busy with other tasks. I can provide (with Docker) some additional LinkedPipes instances for the transformations. In this way the FDP-to-RDF-pipeline doesnt get blocked by the transformations and the transformations doesnt block each other. In any case the pipelines need to be adapted in order to: a) store the rdf-files on our server:

jakubklimek commented 8 years ago

I would agree. In addition, if this is not established elsewhere, I would like to establish metadata rules for datasets uploaded to OBEU servers (in coordination with @larjohn as discussed in, which would affect not only the manually created datasets, but also the automatically created ones and therefore the FDP-to-RDF pipeline (cc @marek-dudas). Specifically (open for discussion):

  1. Metadata should be provided in a separate graph named <dataset-graph-uri/metadata>
  2. In manually created pipelines, these will be created using the DCAT-AP Dataset and DCAT-AP Distribution components, in FDP-to-RDF these would be generated based on the input FDP.

Dataset metadata should contain (from

In addition, there is Distribution metadata, that should contain:

The resulting metadata looks like this:

@prefix dcat: <> .
@prefix dcterms: <> .
@prefix foaf: <> .
@prefix rdf: <> .
@prefix rdfs: <> .
@prefix schema: <> .
@prefix vcard: <> .
@prefix xml: <> .
@prefix xsd: <> .

<> a dcat:Dataset ;
    dcterms:description "Testing dataset description"@en ;
    dcterms:modified "2016-07-22"^^xsd:date ;
    dcterms:publisher <> ;
    dcterms:temporal <> ;
    dcterms:title "Testing dataset"@en ;
    dcat:contactPoint <> ;
    dcat:distribution <> .

<> a vcard:Individual,
        vcard:Kind ;
    vcard:fn "Jakub Klímek" ;
    vcard:hasEmail "" .

<> a dcat:Distribution ;
    dcterms:format <> ;
    dcterms:license <> ;
    dcterms:modified "2016-07-22"^^xsd:date ;
    dcat:accessURL <> ;
    dcat:downloadURL <> ;
    dcat:mediaType <> .

<> a dcterms:PeriodOfTime ;
    schema:endDate "2016-07-31"^^xsd:date ;
    schema:startDate "2016-07-01"^^xsd:date .

<> a dcterms:LicenseDocument ;
    dcterms:type <> .

<> a dcterms:MediaTypeOrExtent .

<> a <> .

<> a dcterms:MediaTypeOrExtent .

<> a foaf:Agent ;
    dcterms:type <> ;
    foaf:name "Vysoká škola ekonomická v Praze"@cs,
        "University of Economics, Prague"@en .

and a pipeline fragment which generates this data is here: OBEU Dataset metadata demo.txt

I recommend that each pipeline developer creates his/her own pipeline fragment for the metadata so that only the properties that change among datasets can be changed.

jindrichmynarz commented 8 years ago

I think metadata is another large issue that can be discussed separately.

Some points on the metadata issue:

What is actually our current use case for having distribution metadata? I understand the use case for dataset metadata described by @larjohn, but I wonder what we need distribution metadata for.

jakubklimek commented 8 years ago

Link from a qb:DataSet to its metadata graph should be explicit (e.g., via rdfs:seeAlso), instead of based on convention, such as appending /metadata to the dataset's IRI. In this way, consumers may simply do things like ?dataset rdfs:seeAlso/dcterms:title ?title . without needing to mash strings together in IRIs.

Sure, but still the metadata graph IRI should follow some convention. Actually, the rdfs:seeAlso should technically lead to the dcat:Dataset IRI (which should be the same as the dataset graph IRI), not the metadata graph IRI.

Some distribution metadata, such as the download URL, depends on the configuration of loaders. This requires duplication in configuration, which is potentially error-prone. Configurable components might help, but may be too complex.

This also depends on the configuration of the publishing server. Sure, it is error-prone, on the other hand we don't expect many manually created pipelines and most of the datasets will be generated automatically. This is not a big issue.

What is actually our current use case for having distribution metadata? I understand the use case for dataset metadata described by @larjohn, but I wonder what we need distribution metadata for.

One reason is being compliant with DCAT-AP, and along with it goes the possibility of generating a DCAT-AP compatible catalog of OBEU datasets. This is definitely worth the few clicks.

On the other hand, the use of the components does not have to be mandatory. If it seems more convenient for yo, you can generate the metadata in any other way such as SPARQL Construct as long as it follows the specification.

jindrichmynarz commented 8 years ago

Sure, but still the metadata graph IRI should follow some convention.

Agreed. Let's propose a convention in D1.5.

Actually, the rdfs:seeAlso should technically lead to the dcat:Dataset IRI (which should be the same as the dataset graph IRI), not the metadata graph IRI.

I think we agreed that the instance of qb:DataSet also instantiates dcat:Dataset and also its IRI is used as the named graph IRI. Thus there would be no benefit in having <dataset> rdfs:seeAlso <dataset>. However, having an explicit link to the metadata graph would be useful.

I agree with the motivation to use DCAT distribution metadata.

skarampatakis commented 8 years ago

Hi, we are on the procedure to upload the Greek Municipalities datasets on Fuseki. Currently all these datasets use the old metadata component carried from UV. For instance we can see an example below:

@prefix dc: <> .
@prefix foaf: <> .
@prefix terms: <> .
@prefix sesame: <> .
@prefix fn: <> .

<> terms:creator <> , <> ;
    terms:issued "2016-01-22"^^xsd:date ;
    terms:language <> ;
    terms:license <> ;
    terms:modified "2016-05-18"^^xsd:date ;
    terms:publisher <> ;
    terms:title "Προϋπολογισμός Εξόδων του Δήμου Θεσσαλονίκης (Ελλάδα) για το έτος 2011"@el ;
    a <> , qb:DataSet ;
    <> <> ;
    foaf:name "" ;
    obeu-attribute:currency obeu-currency:EUR ;
    obeu-dimension:fiscalYear <> ;
    obeu-dimension:operationCharacter obeu-operation:expenditure ;
    obeu-dimension:organization <Δήμος_Θεσσαλονίκης> ;
    dc:publisher <> ;
    terms:contributor _:node1aj2g43bfx4084 ;
    qb:structure <> ;
    rdfs:label "Municipality of Thessaloniki (Greece) expenditure Budget for the fiscal year 2011"@en ;
Q1: The problem is that these are contained within the same named graph, also distribution metadata is missing as this specification was only recently introduced. Should we seperate them and include the distribution metadata?

This would require manual editing of all to upload datasets pipelines, which in our case the count is over 80.

Moreover, uploading to Fuseki and public dumps via LP would require also manual editing of all pipelines.

In my understanding those datasets are to be uploaded to test data mining and other under development tasks.

Q2: If we need to be ready on prototype deadline we can simply bulk upload all datasets "as is".

Editing of the pipelines could be performed on a later stage, where D1.5 will be on a stable form. I believe we should first decide "officially" what metadata to include and what not, in which format should be available ( we use RDF/XML and Turtle, you proposed above TRIG), resolve issues with the data model (if any) and then edit the pipelines, for the last time.

jindrichmynarz commented 8 years ago

Q1: The problem is that these are contained within the same named graph, also distribution metadata is missing as this specification was only recently introduced. Should we separate them and include the distribution metadata?

Yes. See here.

This would require manual editing of all to upload datasets pipelines, which in our case the count is over 80.

As I argued above, the effort invested in metadata might outweigh the value of metadata in the project. Unlike the requirements for dataset metadata for visualization, I current don't see a way in which distribution metadata will be used. @jakubklimek argued above that adding distribution metadata is a matter of "few clicks", however, the situation is different for a large number of datasets. Hence, I think we should first establish how distribution metadata is to be used in the project.

Moreover, uploading to Fuseki and public dumps via LP would require also manual editing of all pipelines.

Since LP-ETL pipelines are represented in RDF, you can write a SPARQL Update to change configuration of multiple pipelines at once. However, with deserializing JSON-LD and reserializing it back it may turn out to require more effort than manual editing.

Q2: If we need to be ready on prototype deadline we can simply bulk upload all datasets "as is".

If we still stick with the requirements to dataset metadata, changes to the pipelines producing the data will be necessary anyway, so I don't think we'd save much effort by uploading dataset dumps first.

Editing of the pipelines could be performed on a later stage, where D1.5 will be on a stable form.

The data model is very stable. But I agree with the motivation that it would be best to wait for the D1.5 to be finalized.

skarampatakis commented 8 years ago

The data model is very stable. But I agree with the motivation that it would be best to wait for the D1.5 to be finalized.

I was referring to general guidelines that will be present I believe in D 1.5, such as dataset and metadata IRI conventions, mandatory and optional properties on metadata, serialization format(?), and stuff like that.

If we still stick with the requirements to dataset metadata, changes to the pipelines producing the data will be necessary anyway, so I don't think we'd save much effort by uploading dataset dumps first.

Sure, but if other tasks are blocked by dataset uploading task, should be postponed? In the case of our datasets, most of the basic properties suggested are already present. I don't believe that this is of concern by any other task. We just save the effort of double editing and re-running the pipelines, task we have done already a couple of times.

jindrichmynarz commented 8 years ago

I was referring to general guidelines that will be present I believe in D 1.5, such as dataset and metadata IRI conventions, mandatory and optional properties on metadata, serialization format(?), and stuff like that.

I see. This is mostly already scattered in various GitHub comments.

All of this will (and more) be added to D1.5.

Sure, but if other tasks are blocked by dataset uploading task, should be postponed? In the case of our datasets, most of the basic properties suggested are already present. I don't believe that this is of concern by any other task. We just save the effort of double editing and re-running the pipelines, task we have done already a couple of times.

Then I would simply upload the pipelines as they are and adjust them later.

skarampatakis commented 8 years ago

I see. This is mostly already scattered in various GitHub comments.

All of this will (and more) be added to D1.5.

That's the reason I insist on concrete, final decision that should be taken before re-editing the pipelines. And the proper place to be is the final version of D 1.5.

Then I would simply upload the pipelines as they are and adjust them later.

If we are not going to re-run the pipelines for now why to upload them? While this is a pretty simple task by a shell script, I believe we can just bulk upload the datasets on Fuseki and serialized dataset files in whatever format on /dump folder. We can upload final version of the pipelines after D 1.5.

jindrichmynarz commented 8 years ago

Actually, I think uploading all pipelines would be an overkill at this moment. What we need to tell first is if we can re-run the pipelines on a different server. So I would start with 1-2 pipelines and see if they work. Then we can adjust them, see if they match our requirements, and only then migrate all the remaining pipelines.

skarampatakis commented 8 years ago

Now we are on the same page. Thank you!

pwalsh commented 7 years ago

@badmotor @jakubklimek @jindrichmynarz I'm closing this. It must be out of date, and any specific issues can be raised with distinct issues.