openbudgets / platform

Tracking issues related to the working around the OpenBudgets.eu platform (WP4).
GNU General Public License v3.0
1 stars 0 forks source link

Setting Up Virtuoso #19

Closed schmaluk closed 7 years ago

schmaluk commented 8 years ago

Hello @larjohn and @jindrichmynarz , I just wanted to ask, if we have a default Graph for Virtuoso, since Im supposed to define one using a pre-built Virtuoso Docker image. Thanks

jindrichmynarz commented 8 years ago

There's no need to define an explicit default graph IRI. If you're using this Docker image, then the default graph defaults to http://localhost:8890/DAV.

schmaluk commented 8 years ago

Thanks @jindrichmynarz for the help, virtuoso_frage

Would you recommend to use the Dumpl-Files here: http://eis-openbudgets.iais.fraunhofer.de/dumps/ for the Import into the Virtuoso-DB together with the Name of the Graph as "Named Graph IRI": http://data.openbudgets.eu/resource/dataset/???? with ???? = the name of dataset-file

Examples: Aragon: Named Graph IRI: http://data.openbudgets.eu/resource/dataset/aragon-income-2016 Greek: Named Graph IRI: http://data.openbudgets.eu/resource/dataset/budget-athens-revenue-2009

Or should I ask the Pipeline developers to re-execute the pipelines for importing datasets into Virtuoso.

jindrichmynarz commented 8 years ago

You can use the Virtuoso bulk loader and generate the *.graph files containing the IRIs of named graphs to load the dumps to. In case of OpenBudgets.eu datasets, the IRIs of named graphs are the same as the IRIs of the qb:DataSet instances in the graphs. This makes it possible to generate the *.graph files automatically, such as using the following shell script:

#!/bin/bash

# Select the IRI of qb:DataSet's instance
QUERY="SELECT ?dataset WHERE { ?dataset a <http://purl.org/linked-data/cube#DataSet> . } LIMIT 1"

# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"

# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
  arq --results CSV --data $file "$QUERY" | tail -1 > $file.graph
done

If this is possible, it may be preferable to asking pipeline developers to re-run their pipelines. Re-running pipelines would be eventually necessary, but I think at the moment we can save the effort with automation.

schmaluk commented 8 years ago

@jindrichmynarz For a lot of *.ttl-files I dont get the name of the graph back but just the content: "dataset". I can do the bulk load. But not sure how good the result will then be. Thanks for advising me so far. For example for:

./estructura_funcional_aragon_2012.ttl.graph ./data.ttl.graph ./estructura_financiacion_g_aragon_2009.ttl.graph ./estructura_economica_g_aragon_2010.ttl.graph ./estructura_economica_i_aragon_2009.ttl.graph ./estructura_financiacion_g_aragon_2010.ttl.graph ./estructura_organica_aragon_2009.ttl.graph ./estructura_economica_i_aragon_2015.ttl.graph ./estructura_economica_i_aragon_2013.ttl.graph ./estructura_economica_g_aragon_2014.ttl.graph ./estructura_financiacion_g_aragon_2012.ttl.graph ./estructura_financiacion_i_aragon_2008.ttl.graph ./estructura_economica_i_aragon_2012.ttl.graph ./estructura_financiacion_g_aragon_2008.ttl.graph ./estructura_economica_g_aragon_2012.ttl.graph ./estructura_financiacion_g_aragon_2013.ttl.graph ./estructura_economica_i_aragon_2014.ttl.graph ./estructura_financiacion_i_aragon_2010.ttl.graph ./estructura_economica_g_aragon_2013.ttl.graph ./estructura_funcional_aragon_2007.ttl.graph ./estructura_economica_g_aragon_2006.ttl.graph ./estructura_funcional_aragon_2009.ttl.graph ./estructura_funcional_aragon_2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2014.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2013.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2014.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2010.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2011-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2014-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2012-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2013-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2011-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2012-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2014-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2013-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2015-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2015-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2012.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-expenditure-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-expenditure-codelist-2011.ttl.graph

`

jindrichmynarz commented 8 years ago

I see. You get the word "dataset" because the query has no results, so only the variable name (i.e. "dataset" is shown). You get no results for these files, because they are not data cubes, but code lists, so they don't contain any instance of qb:DataSet. I wasn't aware that you're loading code lists too, hence the problem. However, I think the problem has a simple solution. In case of code lists we agreed that the IRI of their named graphs is the same as the IRI of the instance of skos:ConceptScheme. Knowing this, we can amend the extraction script like this:

#!/bin/bash

die () {
  echo >&2 "$@"
  exit 1
}

# Select the IRI of qb:DataSet's instance
QUERY=$(cat <<-END
  PREFIX qb:   <http://purl.org/linked-data/cube#>
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

  SELECT ?graph
  WHERE {
    VALUES (?class             ?order) {
           (qb:DataSet         1)
           (skos:ConceptScheme 2)
    }
    ?graph a ?class .
  }
  ORDER BY ?order
  LIMIT 1
END
)

# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"

# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
  arq --results CSV --data $file "$QUERY" |
    [[ $(wc -l <&0) -ne 2 ]] &&
    die "Named graph not found in $file!" ||
    tail -1 > $file.graph
done

The SPARQL query will output either a qb:DataSet's IRI or skos:ConceptScheme's IRI, while qb:DataSet is preferred (you can have data cube with auxiliary code lists). If there is no instance of qb:DataSet or skos:ConceptScheme, the script will exit.

larjohn commented 8 years ago

If you do upload the datasets, can you please also upload:

  1. all files from the codelists repo
  2. the files from this repo: https://github.com/openbudgets/auxiliary-data
jakubklimek commented 8 years ago

This clearly poses a requirement on storage of output RDF datasets in TriG. Then it is much easier to load to Virtuoso.

schmaluk commented 8 years ago

Thanks a lot. @jakubklimek Would that be a task for the pipelines or a separated process involving the triplestore? But for the meantime I will try @jindrichmynarz's solution in order to get a fast result.

jakubklimek commented 8 years ago

This would be a task for the pipelines. It should be included in the pipelines update for inclusion of metadata from D1.5.

schmaluk commented 8 years ago

@larjohn Should I import both folders in https://github.com/openbudgets/Code-lists ? OpenRefine & UnifiedViews Sorry for asking so frequently @jindrichmynarz . Thhere are dsd-files for which no graphname can be found. For example: Fri Oct 7 12:45:49 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2011-dsd.ttl Fri Oct 7 12:45:51 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2004-dsd.ttl Fri Oct 7 12:45:52 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2012-dsd.ttl Fri Oct 7 12:45:54 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2003-dsd.ttl Fri Oct 7 12:45:55 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2004-dsd.ttl Fri Oct 7 12:45:57 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2002-dsd.ttl Fri Oct 7 12:45:58 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2005-dsd.ttl Fri Oct 7 12:46:00 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2016-dsd.ttl Fri Oct 7 12:46:02 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2006-dsd.ttl Fri Oct 7 12:46:06 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2009-dsd.ttl Fri Oct 7 12:46:09 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2008-dsd.ttl Fri Oct 7 12:46:12 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2016-dsd.ttl

Thats probably ok (?)

larjohn commented 8 years ago

Only those at

https://github.com/openbudgets/Code-lists/tree/master/UnifiedViews/skosified

jindrichmynarz commented 8 years ago

I think the data structure definitions (DSD) should be stored in the same named graph as the dataset that is described by them. However, there's no way to know that just from the DSDs file. You'd need to make a lookup table by scanning the datasets and extracting qb:structure links, which connect datasets to their DSDs.

I think this highlights what @jakubklimek said above: this would be easier if we used TriG to serialize data, since it describes explicitly the dataset's named graph. However, if we don't want to force pipeline maintainers to update their pipelines at this moment, I think a provisional solution might be to load the DSDs into the default graph. Virtuoso bulk loader loads data into the default graph if it doesn't find a *.graph file. So we can change the script to avoid creating a *.graph file if no named graph is found:

#!/bin/bash

die () {
  echo >&2 "$@"
  exit 1
}

# Select the IRI of qb:DataSet's instance
QUERY=$(cat <<-END
  PREFIX qb:   <http://purl.org/linked-data/cube#>
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

  SELECT ?graph
  WHERE {
    VALUES (?class             ?order) {
           (qb:DataSet         1)
           (skos:ConceptScheme 2)
    }
    ?graph a ?class .
  }
  ORDER BY ?order
  LIMIT 1
END
)

# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"

# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
  RESULTS=$(arq --results CSV --data $file "$QUERY")
  if [ $(echo "$RESULTS" | wc -l) -eq 2 ]; then
    echo "$RESULTS" | tail -1 > $file.graph
  fi
done
schmaluk commented 8 years ago

@larjohn The datasets (All ttl-files in the dumps-folder + skosified-folder + auxiliary-data-folder) have been hopefully uploaded to virtuoso_staging. In 30 mins they should be also transferred to virtuso_production. The Sparql endpoint of virtuoso_production is exposed for reading at: http://eis-openbudgets.iais.fraunhofer.de/virtuoso/sparql Hope this is working.

larjohn commented 8 years ago

The OBEU model triples are also needed. I uploaded them as well in the staging triple store. But I can only see three datasets. Why is that?

schmaluk commented 8 years ago

@larjohn sorry there was a command missing on my side for the actual execution of the import. Should now be imported

larjohn commented 8 years ago

The following query is returning to many tuples. Why is that?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xro: <http://purl.org/xro/ns#>

SELECT  ?dataset ?attribute ?type ?dsd ?component WHERE {

?type rdfs:subClassOf qb:ComponentProperty .

    ?dsd qb:component ?component .
    ?attribute a ?type .
    ?dataset a qb:DataSet ;
        qb:structure ?dsd .
    ?component ?componentProperty ?attribute .

}

Distinct also times out the triple store.

jindrichmynarz commented 8 years ago

What do you mean by "too many" results? Do you mean that there are duplicate solutions for this query?

larjohn commented 8 years ago

Yes, there are duplicates and their amount does not appear to be consistent - I mean there are more or less duplicates for each solution.

schmaluk commented 8 years ago

@larjohn I dont know why the duplicates are showing up but at least running the Sparql-Select with "distinct" has worked for me with the endpoint: http://eis-openbudgets.iais.fraunhofer.de/virtuoso/sparql

larjohn commented 8 years ago

It now runs faster, I will check and report back

schmaluk commented 8 years ago

@larjohn every hour a cronjob is running which copies everything from staging to production by replacing the DB files. During that process a timeout can happen.

jindrichmynarz commented 8 years ago

The "duplicates" have several causes.

First, a component property may be used in multiple qb:ComponentSpecification in multiple datasets. By the way, I don't know what use does including component specifications in the results has, because they are typically blank nodes.

Second, there are component properties that are instances of more than one subclass of qb:ComponentProperty. See:

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?attribute (GROUP_CONCAT(DISTINCT ?type; separator = ", ") AS ?types)
WHERE {
  [] qb:component [ ?componentProperty ?attribute ] .
  ?attribute a ?type .
  ?type rdfs:subClassOf qb:ComponentProperty .

}
GROUP BY ?attribute
HAVING (COUNT(DISTINCT ?type) > 1)

For each instantiated class, you will get one more result per component property.

pwalsh commented 7 years ago

@badmotor I'm closing this. It must be out of date.