openbudgets / pipeline-fragments

Reusable fragments of LinkedPipes ETL pipelines
2 stars 3 forks source link

Output uses generic codelists #18

Closed skarampatakis closed 7 years ago

skarampatakis commented 7 years ago

Output from FDPtoRDF pipeline uses generic, synthetic like, codelists for dimensions like economicClassification or administrativeClassification.

It is meant that we should enrich and improve datasets in a later stage?

To be more clear, in Greek Municipalities we use common hierarchical codelists as values for economicClassification and administrativeClassification, which are not originally included on data, but rather derived by them. How can we include this process in the procedure?

My thought is that we could define use case specific pipeline fragments to enrich data after basic transformation.

Otherwise transformation is just a representation of FDP in RDF.

jakubklimek commented 7 years ago

As far as I know we said that there will be

  1. the generic, FDP to RDF transformation producing the FDP representation in RDF and if more than that is desirable,
  2. a custom pipeline creating the OBEU RDF representation without FDP in the middle.

Creating additional pipelines on top of 1. is of course possible, but it is not clear how exactly would they fit into our desired workflow and the question is whether it would not be easier to go with 2. in that case.

skarampatakis commented 7 years ago

In the case of 2, you loose the advantage of using a wizard for dataset creation. Originally it was planned that there will be a wizard to configure key aspects of a transformation process and upload data. We switched to use OS packager to upload datasets on OS first and then run the hook to trigger the pipeline, in order to take advantage of the already built wizard.

I can't imagine any PA employ to manually configure pipelines.

Will there be another wizard for custom, OS independent data uploads? Please forgive me if I have misunderstood something here.

jakubklimek commented 7 years ago

That is correct. 1) for PA where the FDP to RDF process is fully automated, which excludes adding more information than there is in FDP and 2) for more sophisticated transformations made by experts.

I am not aware of any other wizard for fine tuning the result of FDP to RDF being planned or discussed.

jindrichmynarz commented 7 years ago

I don't think the original intention for the FDP to RDF transformation was merely a syntactic mapping to RDF. While it is not meant to produce high fidelity datasets that fully conform to the OpenBudgets.eu data model, it can reuse terms from the data model.

jakubklimek commented 7 years ago

Still, the transformation is automatic, which means it can only use information already present in FDP. If it is possible to determine more specific OBEU data model terms from FDP, then it should. Can this be further specified?

jindrichmynarz commented 7 years ago

It's true that FDP doesn't provide that much semantics to latch on the mapping. There are several dimension types or recognized code lists. This allows us to make broad mappings, such as minting subproperties of obeu-dimension:classification. Besides explicit specifications, we could make mappings on matching names, but that would be probably brittle. For instance, instead of declaring economicClassification as a subproperty of obeu-dimension:classification, it could be directly casted as obeu-dimension:economicClassification (based on the matching local name). This method may produce false positives, where the economicClassification from the FDP source has different semantics.

Since such general approach seems brittle, I think it would need to be discussed on a case by case basis.

pwalsh commented 7 years ago

@jakubklimek

Still, the transformation is automatic, which means it can only use information already present in FDP. If it is possible to determine more specific OBEU data model terms from FDP, then it should. Can this be further specified?

IIRC Tiansi and Marek were to work on this (mapping OS/FDP types more explicitly to OBEU types) - it was discussed in the Thessaloniki Plenary.

skarampatakis commented 7 years ago

Let's talk on a specific real scenario to further understand the workflow.

We have budget data from the Municipality of Thessaloniki. These data was gathered by custom scripts from the official service of the Municipality. Custom scripts was required because data came in separate files for each year, operation character and different public services (administrations) of the municipality. Data was concatenated in two files for each year, one for expenditure and one for revenue.

CSV formated files, containing a specific activity( or functionality) on each row. The label of the administrtion, the code of the functionality, a descriptio/label of it, and 4 two 5 values data columns containing amounts for each budget phase. Originally we have described these data using the OBEU data model, creating an observation for each budget phase, differing from other observations derived from the same row/functionality only on the budget phase dimension. All the rest dimesions are the same for each observation of the same row.

We added two dimensions, administrative and economic after the basic transformation using common codelists, specific and standard for the Greek Municipalities. The mapping is based on regex. More specific on the data we have the label of the administration so we make a comparison with the labels on the codelist to create the mapping. The economic classification code is hidden in the functionality code (the first 4 digits) so we are comparing this with the notations of the particular codelist (KAE).

Using the OS packager you have the following options

image

Here you define the basic dimensions. If you declare that a specific column is a label you have to provide also a column for the code. image

Here you define the amounts. I believe that the main difference is that we perceive operation character and budget phase as dimensions, while in FDP these are attributes. That is the reason for the difference in modeling between custom and automated way. I don't know how easy it would be to override this fact and be compliant with the OBEU data model, but I think it should.

And the other is that I cannot continue this procedure because there isn't any date column(!!) and I cannot define a common for the whole dataset(!!!!!!!).

pwalsh commented 7 years ago

Hey all

We have two distinct issues here:

  1. Mapping OS/FDP types to OBEU types. We discussed this in Thessaloniki, @HimmelStein was to lead on this, based on the initial pipeline that @marek-dudas wrote (specifically, Tiansi was to take the generic pipeline and add the specific mappings for the actual types used in OS. I think https://github.com/openbudgets/obeu-types was a start on this that Tiansi made). @HimmelStein @badmotor what is/was the status of this?
  2. The OS Viewer does not currently support what we call constants in the FDP spec (date in this case, @skarampatakis is a constant, and needs to be declared outside of the data source. constants are supported by the rest of the system, including pipelines). This has not blocked any usage so far, but of course we want to add it and it is scheduled https://github.com/openspending/openspending/issues/1130 . I'll talk to @akariv and get it prioritized for addition ASAP.
HimmelStein commented 7 years ago

@pwalsh @skarampatakis I remember that I encountered the same problem, when I using the interface several month ago, and raised an issue for this (the interface requires that dataset must have date column). The solution is quite easy, just need Adam to remove such checking in his code (maybe only need to disable one line).

@pwalsh In Thessaloniki, we suggested an automatic tool to transform OS/FDP types into OBEU types. I made a mapping based on Adam's code https://github.com/openbudgets/obeu-types, and Marek (@marek-dudas) developed a tool -- based on the FDP documentation at OS, and updated with the mapping table. Marek's pipeline is installed at Fraunhofer server, and is tested with testing-updating circles (first by me, later by Maik @mlukasch). It would be nice that the took can be tested more widely.

HimmelStein commented 7 years ago

the took can be tested more widely --> the tool can be tested more widely.

fathoni commented 7 years ago

Aside from the differences in handling the attributes vs. dimensions in FDP vs. OBEU data model as raised by Sotiris, I am also concerned about how the datasets' heterogeneity should be handled during the transformation.

All the datasets that we've been trying to work on here (e.g., ESIF, Aragon, Bonn) have different structures. ESIF listed the incomplete code lists (code+label) inside the datasets themselves, Aragon listed the code list externally (and nicely) on different files, and Bonn, on the other hand, encode appended code lists implicitly and these code lists needs to be preprocessed (e.g., with regex or create the CSV files manually to extract all of their code lists). Not to mention the OpenCoesione datasets with vast amounts of columns.

If this needs to be addressed, there may be one more thing need to be added on the OS Packager (i.e., allow adding separate files which contain the full list of the code and description). Otherwise, the data maintainer would have to restructure their datasets so that the datasets include the code and the code description as well.

pwalsh commented 7 years ago

@fathoni OS Packager does not and will not handle cases of external code lists - such scenarios are extremely rare in real, published fiscal data - almost all data is published with codes and descriptions inlined in the data files.

skarampatakis commented 7 years ago

@pwalsh

The OS Viewer does not currently support what we call constants in the FDP spec (date in this case, @skarampatakis is a constant, and needs to be declared outside of the data source. constants are supported by the rest of the system, including pipelines).

I think you are mentioning OS Packager here and not Viewer. Thank you. I believe it is something really needed. In the case of Greek Municipalities almost none of the datasets contain a column dedicated on the date dimension itself because it is indeed a constant.

@fathoni OS Packager does not and will not handle cases of external code lists - such scenarios are extremely rare in real, published fiscal data - almost all data is published with codes and descriptions inlined in the data files.

Talking about this, none of the datasets we gathered about the Greek Municipalities contain all the codelists inline with nice and separate columns. It is rather "hidden" within the data. In some cases there are just the codes, whilst in other just the labels. The latter is the worst as currently OS Packager would require a separate column for the codes in order to continue. So I believe that it is not really rare as it seems to be an issue in almost all datasets we have ecountered so far during OpenBudgets.eu.

There are solutions were we can all be happy. I believe that it is not needed to load codelists on the packager. One could provide just the name of the codelist (by name I mean the IRI of the codelist Concept Scheme) and the rest is handled by the FDPtoRDF pipeline. If it is a code column it will be mapped with the according skos:notation and if it is a label it will be mapped to according skos:prefLabel with some regex. OS could be benefited as well because it could get the missing information about the codelists (codes or labels + hierarchies).

Another scenario could be an intermidiate layer/interface between the OS packager and FDPtoRDF pipeline, where a user could fine tune the whole procedure

...or refinement could happen after the basic convertion.

fathoni commented 7 years ago

So I did a survey on around 75 budget/spending datasets from the different levels of administrations across different regions regarding how the code lists were published. Indeed @pwalsh the datasets that publish separate code lists are not that common, but as mentioned by @skarampatakis it is not that rare either. I found 9-10% of the code lists are published in a different file, which would be nice if we manage to cover this.

The first solution proposed by @skarampatakis would be the most practical one. In this case, I think we need to add a way for the datasets publisher to attach their code list files which can be transformed separately in the background if it is deemed necessary. The second proposed solution would also even nicer though we may need more efforts to do this.

skarampatakis commented 7 years ago

@fathoni to which datasets do you refer?

fathoni commented 7 years ago

@skarampatakis I explored the datasets collected from the Open Spending datasets registry on Github and added several other datasets from different public administrations.

marek-dudas commented 7 years ago

So to sum up this issue: FDP does not support external codelists, so the question is what to do if you have an external codelist and create a dataset with OS-packager. I would say that the only solution is that you have to add the codelist manually after the FDP-to-RDF transformation. I think that such an experienced user wanting to link to an existing external codelist would not use OS-packager, but rather create a dataset specific transformation pipeline manually.

skarampatakis commented 7 years ago

You have to remember that PA officers are not SW experts neither that experienced users. I don't believe that at any point a non system administrator user should have access to the pipelines themselves neither should LP accessed by users at all.

The use case is pretty simple. The orginal CSV contains just the code but the code refers to an external codelist. Or it could be derived from it. The user could either upload the codelist, or use some from a pool of available codelists.

A possible solution to all the problems mentioned on this issue would be to develop a new Packager, based on the orginal OS packager to save time or built from scratch(my preference), that would enable such configurations.

Then data would be converted using the OBEU data model and uploaded to our endpoint, and if the user demands to, will have the option to run a hook to convert the data to FDP and upload them through an OS specific pipeline to OS. Now we are doing the oppossite, while I believe the way I mention would be more valuable for all. Ie missing codelist columns could be added using external codelists, or mapping with resource instead of text strings.

jindrichmynarz commented 7 years ago

I agree, though the bottleneck here is the OS Packager, so it might be discussed elsewhere.

fathoni commented 7 years ago

I agree with points mentioned by @skarampatakis above. The availability of intermediate layer would allow any users upload the separated code lists and provide extra information. In the end, this would ease the data transformer who would be responsible for the manual transformation in LP, since the transformer has already the code list collected in bulk. Indeed, the data sets with external code list is not as common as compared to in-dataset provided code list, however, our experience with Bonn / Aragon / ESIF / Italian OpenCoesione has always involved external code lists. Perhaps it would be worth to consider?

skarampatakis commented 7 years ago

Adding that Greek Datasets also use external codelists, or could be mapped to easily, that makes almost 100% percent of the datasets we are studying on the project dependent on this. I don't see how this is not common.

I agree that this is not an issue of the FDPtoRDF pipeline in general, but it acted as the proof of concept.

pwalsh commented 7 years ago

I think, as @jindrichmynarz says, if the discussion is swinging to what features we expose to humans in OS Packager, then, we should close this thread and take that conversation up elsewhere.

@skarampatakis I think you misunderstand the use case that OS Packager solves:

If, as you say, the preferable solution is to build a new packager from scratch, you are certainly most welcome to do so, but I think you are vastly underestimating the work that goes into the user experience of an app like OS Packager, and how much time it takes to build something like that.

skarampatakis commented 7 years ago

@pwalsh IIRC , last January after the initial development of some of the pipelines we developed for testing purposes and giving some input for the WP2 and WP3, my impression was that there would be a wizard to configure very limited aspects of the pipelines. I think that @HimmelStein would be responsible for this.

Now we have the option to use OS Packager and if the data are correct and successfully uploaded run a hook. By the way, there seems to be a bug here.

image

My proposed solution is not different from what OS Packager solves. With the difference that it would be more flexible if the user, or the use case demands to. I am not underestimating time to develop OS Packager or the whole OS platform.

It is offered as part of the OBEU suite for those users, and not to 100% provide the functionality of the handful of manually written pipelines from the past 2 years

As you said, it cannot serve 100% of our needs. But we cannot pretend that these cases are a minority, whilst it seems to be a great part of our study. Simply put, we cannot stress the government and other public bodies to publish data according to OS specifications. They have their own and we are here to provide technical solutions on how to improve data publishing, analysis and transparency.

If a fork of OS Packager would be enough, and would minimize development effort I have no problem at all to support this solution. But I believe an OS independent solution would give as more flexibility. From my experience, it is easier and more practical to develop something from scratch, than trying to transform something to something else that was not designed for. If we could reuse some parts would be perfect (UI or similar).

At the end of the day, this is a decision we should make as a whole. But I think that now it is crystal clear to all that we need an alternative solution for data uploading between using OS Packager as is, and creating custom pipelines.

pwalsh commented 7 years ago

@skarampatakis

Simply put, we cannot stress the government and other public bodies to publish data according to OS specifications. They have their own and we are here to provide technical solutions on how to improve data publishing, analysis and transparency.

Ok, perhaps you have more experience working with governments on fiscal data than I do.

In any event, if you are confident that a complete rewrite of an application for user-facing data modelling is the way to go, in order to "improve data publishing, analysis and transparency", at month ~23 of this project, go right ahead :).

skarampatakis commented 7 years ago

Well, I don't think that at any point I argued on this. I just presented some facts.

I was concerned mostly by this comment

@fathoni OS Packager does not and will not handle cases of external code lists - such scenarios are extremely rare in real, published fiscal data - almost all data is published with codes and descriptions inlined in the data files.

Which seems to be a no go for OS Packager in the case of at least all Greek Datasets we have encountered.

My personal opinion is that there have to be a solution, even if it will be finished on the last week of the project, or even after the end of the project.