Unifying Properties in the Datasets for Improving Performance of Sparql-Queries in Rudolf

openbudgets / platform

Tracking issues related to the working around the OpenBudgets.eu platform (WP4).

GNU General Public License v3.0

1 stars 0 forks source link

Unifying Properties in the Datasets for Improving Performance of Sparql-Queries in Rudolf #22

Closed schmaluk closed 7 years ago

schmaluk commented 7 years ago

Hello @larjohn, @jindrichmynarz , @jakubklimek and @marek-dudas , I thought it would be better to have this on github rather than in email correspondance for better collaboration. So I have put the mail correspondance between @larjohn and me now here. (I hope I have understood @larjohn correctly and trying to repeat it.)

Remark: @marek-dudas This problem occurred when you search in the http://eis-openbudgets.iais.fraunhofer.de/explorer for the datasets which have been imported by the FDP-to-RDF-pipeline. They arent shown in the Explorer.

There seems to be some differences in the naming of properties between different datasets which might not affect their correctness but might have a big effect on constructing performant SPARQL-Queries:

Please see the following SPARQL-Query in Rudolf:

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xro: <http://purl.org/xro/ns#>

SELECT ?attribute (MAX (?_label) AS ?label) (SAMPLE (?_title) AS ?title) ?attachment (SAMPLE (?_propertyType) AS ?propertyType) ?shortName (MAX (?_datasetName) AS ?datasetName) ?dataset (SAMPLE (?_datasetLabel) AS ?datasetLabel) ?currency ?year WHERE {
    ?dsd qb:component ?component .
    ?dataset a qb:DataSet ;
        qb:structure ?dsd ; <http://purl.org/dc/terms/title> ?_title .
    ?component ?componentProperty ?attribute .
    ?componentProperty rdfs:subPropertyOf qb:componentProperty .
    OPTIONAL {
        ?dataset <http://data.openbudgets.eu/ontology/dsd/attribute/currency> ?currency .
    }
    OPTIONAL {
        ?dataset rdfs:label ?_datasetLabel .
    }
    OPTIONAL {
        ?attribute rdfs:label ?_label .
    }
    OPTIONAL {
        ?component qb:componentAttachment ?attachment .
    }
    OPTIONAL {
        ?dataset <http://data.openbudgets.eu/ontology/dsd/dimension/fiscalYear> ?year .
    }
    OPTIONAL {
        ?attribute a ?_propertyType .
        FILTER (?_propertyType in (qb:MeasureProperty, qb:DimensionProperty, qb:CodedProperty))
    }
    BIND (REPLACE (str (?attribute), '^.*(#|/)', "") AS ?shortName) BIND (CONCAT (REPLACE (str (?dataset), '^.*(#|/)', ""), '__', SUBSTR (MD5 (STR (?dataset)),1,5)) AS ?_datasetName)
}
GROUP BY ?attribute ?shortName ?attachment ?dataset ?currency ?year

All these OPTIONAL are lowering the performance. If all the datasets do have these attributes, the optionals can be removed and a lot of performance can be gained.

Specifically, http://purl.org/dc/terms/title is not the same as http://purl.org/dc/elements/1.1/title, although they have the same meaning and in Linked Data terms should be interchangeable.

Is there a way to run SPARQL queries in such a liberal way that takes into account all the alternative possibilities in a performant way?

The same goes also for the fiscalYear/fiscalPeriod properties and for their values which contain the "gregorian" part, while the other datasets don't. All these differences pose a great risk both to performance and to the correctness of the queries built by rudolf.

Can we somehow unify the naming of the properties & their values in the datasets? The validation & clearance pipeline would be still missing, which maybe could be used to improve the quality of the datasets in that regard and afterwards move datasets from the staging DB to the production DB. Or would another solution be better and easier here?

Thank you for your help

jakubklimek commented 7 years ago

Specifically, http://purl.org/dc/terms/title is not the same as http://purl.org/dc/elements/1.1/title, although they have the same meaning and in Linked Data terms should be interchangeable.

dc: elements are deprecated in favor of dcterms: for usage in Linked Data. I don't see it used in the query, is it used in some specific datasets? Then it should be changed to dcterms:

Is there a way to run SPARQL queries in such a liberal way that takes into account all the alternative possibilities in a performant way?

Sometimes it is possible to rewrite OPTIONALs into UNIONs when there are not many, which is often faster to execute.

The same goes also for the fiscalYear/fiscalPeriod properties and for their values which contain the "gregorian" part, while the other datasets don't. All these differences pose a great risk both to performance and to the correctness of the queries built by rudolf.

On the other hand, this kind of heterogenity is a quite natural aspect of Linked Data. In addition, the specific string of the IRI should not matter, as it is not to be parsed anyway. If, for example, a label for the year or period is needed, the IRI should be dereferenced and the label taken from the data. For performance reasons, this can be done in advance by an enhancement pipeline which would cache these responses or their parts in a separate named graph.

In general 1) I don't see the point in querying over all the heterogeneous datasets at once using one query - what sort of question this query answers? A more typical workflow would be for the user to select a few datasets with which he/she wants to work, I suppose. Do we have any user stories to support one or the other? 2) It is natural to have some level of inconsistency. Instead of trying to solve them all in one query, this can be resolved progressively e.g. by normalizing pipelines, which compute the missing properties if possible (e.g. one specific label (e.g. dcterms:title) out of all used options(e.g. dcterms:title, schema:name, foaf:name, skos:prefLabel, ...), fiscalPeriod from fiscalYear, etc.) so that so many OPTIONALs do not have to be used.

larjohn commented 7 years ago

Some dimensions have special meaning - I am referring to the time and the currency dimensions, as they play a special role. Even though, currently, they are not used that way, for performance issues, a very good use case would be to let us aggregate observations of a municipality across years. Now imagine Czechia joins the Eurozone, and this aggregation has to take account two different currencies - the exchange rates are varying across years, so you also need a standard way to approach the time dimension.

It may seem obvious that this is an extreme case, and/or the Czech datasets will always be homogeneous, but a more abstract API, like Rudolf, would need to guess/know this information, in case we need to take into account the exchange rates and the inflation, the way @jindrichmynarz suggested in his blog post.

Regarding the normalization, it is required, otherwise it will be very difficult for the end users to discover the endless possibilities and their combinations themselves - that is why we built Rudolf to start with.

jindrichmynarz commented 7 years ago

I think this is not a use case for a real-time query, because RDF stores in general haven't progressed enough to be capable of answering such complex queries. Instead, I see this as a use case for asynchronous ETL, in which the query can be split into several simple steps.

The variability of component properties in OpenBudgets.eu datasets in general doesn't indicate bad quality, but instead it simply reflects that the datasets are quite diverse.

larjohn commented 7 years ago

I totally agree on the complexity of the queries. Nevertheless, such a query (only the currency part) with two drilldown dimensions, took me ~30 seconds in 8GB VM with a dual core CPU. I find this acceptable. The problem with this was mainly scalability - a third currency increased the query execution time in a non proportional manner.

Currently, "global" queries utilize UNIONS, as @jakubklimek suggested. A point for optimization is to eliminate dimensions that will not be anyway in the result set, as they are filtered out. For instance, if the user filters the "organization" dimension with the value Bonn, then the UNIONs needed are a lot fewer than if no filter was set. This will be implemented in a next commit of Rudolf. Other than that, about a third of the UNIONs don't come from the actual heterogeneity of the data structures, but from missing labels.

Currently, when creating "global" dimensions, all the actual dimensions are cached and reused during a query to the "global" cube. Each of these dimensions has its inner attributes also cached (prefLabel, notation etc.). If a single dataset misses labels for a dimension codelist, this results in an additional UNION in the global cubes query, to avoid excluding the values that don't have labels. If all codelists are given to have prefLabels and notations, or at least a key and a label attribute (reuse URIs where this is not possible), then the queries will be significantly simplified and more real-time use cases can become possible.

schmaluk commented 7 years ago

Thanks for the input: @jakubklimek @jindrichmynarz 1) In the Explorer: http://eis-openbudgets.iais.fraunhofer.de/explorer/ all available datasets from the triplestore should get shown. This would require such a SPARQL-Query over all datasets I think (?) 2) So you would suggest to simplify the SPARQL-queries by using pipelines to make sure the datasets are more unified so that @larjohn 's SPARQL_queries can rely on specific attributes. (Maybe even @larjohn and @jindrichmynarz 's example of the exchange rates and the inflation can be taken into account by preprocessing the datasets via pipelines (?)).

jakubklimek commented 7 years ago

1) The list of all datasets is simple, what I meant was the complex query on top of all the data. 2) It is a possibility. Of course, there is a price. In order for the user to see his data through rudolf, he would have to additionally wait until his dataset is enriched by these pipelines. So the question is whether this would be preferrable to the current state.

jindrichmynarz commented 7 years ago

Code lists with missing values may be provided with synthetic, automatically generated values as we discussed in September with @larjohn. This can be done in the clearance pipeline that synchronizes data to the production RDF store.

1) In the Explorer: http://eis-openbudgets.iais.fraunhofer.de/explorer/ all available datasets from the triplestore should get shown. This would require such a SPARQL-Query over all datasets I think (?)

Yes, you'd query over the default union graph of the RDF stores.

2) So you would suggest to simplify the SPARQL-queries by using pipelines to make sure the datasets are more unified so that @larjohn 's SPARQL_queries can rely on specific attributes. Maybe even @larjohn 's example of the exchange rates and the inflation can be taken into account by preprocessing the datasets via pipelines (?).

Some of the issues can be resolved by pre-processing, such as the missing labels in code lists I mentioned above. However, querying over multiple heterogeneous datasets is an inherently difficult problem to which neither RDF, nor non-RDF technology has a good answer. I think we should first clarify what are the use cases for (near) real-time queries, so that we have a narrower focus.

larjohn commented 7 years ago

I generally consider 1 minute to be near real-time. Unfortunately to say a use case is of interest is not so significant compared to the possibility to run it in real time. Then comes scalability into play. What happens if 100 new datasets are added? Will it be still possible to run the same queries? A solution to this is the creation of graphs as you suggest.

A nice use case would be the following: create a map of the EU and given the existing datasets in the triple store, render on each municipality available a small line chart showing the progress of the amounts spent for Health during the last 5 years. Please don't adhere to this specific case's significance- I am just exploring possibilities, I am not defending this very specific scenario. So, this use case, currently is ALMOST doable, and the "almost" part is what bothers me more:

Code lists should be linked
Queries with linked code list terms should be created, tested, optimized
Budgetary Unit vs Organization: currently, in the OpenSpending Viewer such a dimension is used to get geolocation information. This is not yet ready in rudolf, but even so, how do we know these two could be grouped and provide geolocation if they don't have anything in common (they might do - I am not so sure)
Now let Greece exit euro and you can't even draw a single line chart in any Greek municipality if you can't get currency exchange rates live.
OpenSpending Viewer cannot currently dynamically select the most appropriate map (the whole EU map in this case) to display data, but this is also doable.

EDIT: also consider coloring each municipality according to amount, which is how the OS Viewer works right now: how do you use the same color code with different currencies?

jindrichmynarz commented 7 years ago

Code lists should be linked

While linking code lists is can be done manually by domain experts, it takes too much effort unless you only link code list concepts addressing a particular use case, such as the health expenditure tracking you mentioned. Unfortunately, usually there's not enough data in code lists to enable reliable automated linking.

Budgetary Unit vs Organization: currently, in the OpenSpending Viewer such a dimension is used to get geolocation information. This is not yet ready in rudolf, but even so, how do we know these two could be grouped and provide geolocation if they don't have anything in common (they might do - I am not so sure)

obeu-dimension:budgetaryUnit was deprecated some time ago in favour of obeu-dimension:organization, so datasets should not use obeu-dimension:budgetaryUnit. If there are such datasets, either their ETL pipeline can be updated or the obeu-dimension:budgetaryUnit can be automatically rewritten to obeu-dimension:organization when synchronizing data to the production RDF store. Since data about the organization that publishes its budget or spending is typically implicit, it would need to be added manually by the ETL developers, if you want to be able to do geocoding.

Now let Greece exit euro and you can't even draw a single line chart in any Greek municipality if you can't get currency exchange rates live.

In that case, the normalization using GDP deflators could help, right? If we want to compare a non-EU country, we'd need to obtain the GDP deflators from elsewhere than Eurostat (e.g., OECD), but in general it should be doable.

larjohn commented 7 years ago

Right, so far:

We will try to discuss on a code list linking use case either in Prague or earlier - Fabrizio contacted me on this
Good to know budgetaryUnit is deprecated - I've seen it in the Czech datasets, if I am not mistaken. If we use DBpedia entities for this one, we can get ISO codes for free and voila, the OS Viewer can digest those with a little more effort.
Labels AND notation (keys) should be everywhere. They might be additional effort/space but it is worth it
Regarding metadata:
- Has the discussion here reached a useful conclusion? Can I find some kind of specification somewhere? Please excuse me if you have presented it and I can't remember.
- Can we just accept that the data.gov.uk years will contain gregorian or not? Then the creation of "yearly" exchange rates dataset would be possible using the data from linked.opendata.cz (I am already doing this, but I have been constructing years without the gregorian part).
I might need a helping hand to use the GDP deflators for EU. For the time being, currency conversion could be enough.

With 2., 3., and 4. a performance gain should be expected in rudolf.

jindrichmynarz commented 7 years ago

Good to know budgetaryUnit is deprecated - I've seen it in the Czech datasets, if I am not mistaken. If we use DBpedia entities for this one, we can get ISO codes for free and voila, the OS Viewer can digest those with a little more effort.

obeu-dimension:budgetaryUnit is used mostly in Aragón datasets. Here are the DSD's that include the property:

http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2006
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2007
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2008
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2009
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2010
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2011
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2012
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2013
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2014
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2015
http://data.openbudgets.eu/ontology/dsd/aragon-budget-exp-2016
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2006
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2007
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2008
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2009
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2010
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2011
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2012
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2013
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2014
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2015
http://data.openbudgets.eu/ontology/dsd/aragon-budget-inc-2016
http://data.openbudgets.eu/ontology/dsd/bonn-budget
http://data.openbudgets.eu/ontology/dsd/eu-budget-2014
http://example.openbudgets.eu/ontology/dsd/ESIF_2014-2020

Not every one of these datasets was selected in #16 to be used further, so some don't need updating.

Labels AND notation (keys) should be everywhere. They might be additional effort/space but it is worth it

Agreed.

Regarding metadata:

Has the discussion here reached a useful conclusion? Can I find some kind of specification somewhere? Please excuse me if you have presented it and I can't remember.

Yes, recommended metadata are described in section 9 of D1.5 (p. 33).

Can we just accept that the data.gov.uk years will contain gregorian or not? Then the creation of "yearly" exchange rates dataset would be possible using the data from linked.opendata.cz (I am already doing this, but I have been constructing years without the gregorian part).

This is not explicitly specified in D1.5, but we use years from the Gregorian calendar there in the examples. We can establish a convention of using the Gregorian years. You can also use the yearly exchange rates from LinkedStatistics' Eurostat, but they use xsd:dates to represent years.

I might need a helping hand to use the GDP deflators for EU. For the time being, currency conversion could be enough.

Sure, I'm happy to provide you with support regarding the monetary value normalization pipeline fragment.

pwalsh commented 7 years ago

@jindrichmynarz @larjohn @badmotor

Should this be kept open, for action, or not.

If it should be actioned, let's assign a single person with responsibility.

jindrichmynarz commented 7 years ago

@larjohn, is this still a problem in Rudolf?

pwalsh commented 7 years ago

@larjohn @skarampatakis please advise.

larjohn commented 7 years ago

I am closing this as it is mostly a matter of fixing older datasets, which does not seem to be too realistic currently. Otherwise, some of the issues have indeed been fixed.