Closed schmaluk closed 7 years ago
There's no need to define an explicit default graph IRI. If you're using this Docker image, then the default graph defaults to http://localhost:8890/DAV
.
Thanks @jindrichmynarz for the help,
Would you recommend to use the Dumpl-Files here: http://eis-openbudgets.iais.fraunhofer.de/dumps/ for the Import into the Virtuoso-DB together with the Name of the Graph as "Named Graph IRI": http://data.openbudgets.eu/resource/dataset/???? with ???? = the name of dataset-file
Examples: Aragon: Named Graph IRI: http://data.openbudgets.eu/resource/dataset/aragon-income-2016 Greek: Named Graph IRI: http://data.openbudgets.eu/resource/dataset/budget-athens-revenue-2009
Or should I ask the Pipeline developers to re-execute the pipelines for importing datasets into Virtuoso.
You can use the Virtuoso bulk loader and generate the *.graph
files containing the IRIs of named graphs to load the dumps to. In case of OpenBudgets.eu datasets, the IRIs of named graphs are the same as the IRIs of the qb:DataSet
instances in the graphs. This makes it possible to generate the *.graph
files automatically, such as using the following shell script:
#!/bin/bash
# Select the IRI of qb:DataSet's instance
QUERY="SELECT ?dataset WHERE { ?dataset a <http://purl.org/linked-data/cube#DataSet> . } LIMIT 1"
# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"
# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
arq --results CSV --data $file "$QUERY" | tail -1 > $file.graph
done
If this is possible, it may be preferable to asking pipeline developers to re-run their pipelines. Re-running pipelines would be eventually necessary, but I think at the moment we can save the effort with automation.
@jindrichmynarz For a lot of *.ttl-files I dont get the name of the graph back but just the content: "dataset". I can do the bulk load. But not sure how good the result will then be. Thanks for advising me so far. For example for:
./estructura_funcional_aragon_2012.ttl.graph ./data.ttl.graph ./estructura_financiacion_g_aragon_2009.ttl.graph ./estructura_economica_g_aragon_2010.ttl.graph ./estructura_economica_i_aragon_2009.ttl.graph ./estructura_financiacion_g_aragon_2010.ttl.graph ./estructura_organica_aragon_2009.ttl.graph ./estructura_economica_i_aragon_2015.ttl.graph ./estructura_economica_i_aragon_2013.ttl.graph ./estructura_economica_g_aragon_2014.ttl.graph ./estructura_financiacion_g_aragon_2012.ttl.graph ./estructura_financiacion_i_aragon_2008.ttl.graph ./estructura_economica_i_aragon_2012.ttl.graph ./estructura_financiacion_g_aragon_2008.ttl.graph ./estructura_economica_g_aragon_2012.ttl.graph ./estructura_financiacion_g_aragon_2013.ttl.graph ./estructura_economica_i_aragon_2014.ttl.graph ./estructura_financiacion_i_aragon_2010.ttl.graph ./estructura_economica_g_aragon_2013.ttl.graph ./estructura_funcional_aragon_2007.ttl.graph ./estructura_economica_g_aragon_2006.ttl.graph ./estructura_funcional_aragon_2009.ttl.graph ./estructura_funcional_aragon_2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2014.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2013.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2014.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2010.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-revenue-codelist-2016.ttl.graph ./greek-municipalities/municipality-of-veroia/codelist/veroia-budget-expenditure-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2011-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2014-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2012-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2013-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2011-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2012-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2014-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2013-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-be2015-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/dsd/thess-br2015-dsd.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2012.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-expenditure-codelist-2015.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-revenue-codelist-2011.ttl.graph ./greek-municipalities/municipality-of-thessaloniki/codelist/budget-thessaloniki-expenditure-codelist-2011.ttl.graph
`
I see. You get the word "dataset" because the query has no results, so only the variable name (i.e. "dataset" is shown). You get no results for these files, because they are not data cubes, but code lists, so they don't contain any instance of qb:DataSet
. I wasn't aware that you're loading code lists too, hence the problem. However, I think the problem has a simple solution. In case of code lists we agreed that the IRI of their named graphs is the same as the IRI of the instance of skos:ConceptScheme
. Knowing this, we can amend the extraction script like this:
#!/bin/bash
die () {
echo >&2 "$@"
exit 1
}
# Select the IRI of qb:DataSet's instance
QUERY=$(cat <<-END
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?graph
WHERE {
VALUES (?class ?order) {
(qb:DataSet 1)
(skos:ConceptScheme 2)
}
?graph a ?class .
}
ORDER BY ?order
LIMIT 1
END
)
# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"
# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
arq --results CSV --data $file "$QUERY" |
[[ $(wc -l <&0) -ne 2 ]] &&
die "Named graph not found in $file!" ||
tail -1 > $file.graph
done
The SPARQL query will output either a qb:DataSet
's IRI or skos:ConceptScheme
's IRI, while qb:DataSet
is preferred (you can have data cube with auxiliary code lists). If there is no instance of qb:DataSet
or skos:ConceptScheme
, the script will exit.
If you do upload the datasets, can you please also upload:
This clearly poses a requirement on storage of output RDF datasets in TriG. Then it is much easier to load to Virtuoso.
Thanks a lot. @jakubklimek Would that be a task for the pipelines or a separated process involving the triplestore? But for the meantime I will try @jindrichmynarz's solution in order to get a fast result.
This would be a task for the pipelines. It should be included in the pipelines update for inclusion of metadata from D1.5.
@larjohn Should I import both folders in https://github.com/openbudgets/Code-lists ? OpenRefine & UnifiedViews Sorry for asking so frequently @jindrichmynarz . Thhere are dsd-files for which no graphname can be found. For example: Fri Oct 7 12:45:49 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2011-dsd.ttl Fri Oct 7 12:45:51 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2004-dsd.ttl Fri Oct 7 12:45:52 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2012-dsd.ttl Fri Oct 7 12:45:54 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2003-dsd.ttl Fri Oct 7 12:45:55 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2004-dsd.ttl Fri Oct 7 12:45:57 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2002-dsd.ttl Fri Oct 7 12:45:58 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2005-dsd.ttl Fri Oct 7 12:46:00 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2016-dsd.ttl Fri Oct 7 12:46:02 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2006-dsd.ttl Fri Oct 7 12:46:06 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2009-dsd.ttl Fri Oct 7 12:46:09 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-br2008-dsd.ttl Fri Oct 7 12:46:12 CEST 2016 No graphname found for ttl-file: greek-municipalities/municipality-of-kalamaria/dsd/kalamaria-be2016-dsd.ttl
Thats probably ok (?)
I think the data structure definitions (DSD) should be stored in the same named graph as the dataset that is described by them. However, there's no way to know that just from the DSDs file. You'd need to make a lookup table by scanning the datasets and extracting qb:structure
links, which connect datasets to their DSDs.
I think this highlights what @jakubklimek said above: this would be easier if we used TriG to serialize data, since it describes explicitly the dataset's named graph. However, if we don't want to force pipeline maintainers to update their pipelines at this moment, I think a provisional solution might be to load the DSDs into the default graph. Virtuoso bulk loader loads data into the default graph if it doesn't find a *.graph
file. So we can change the script to avoid creating a *.graph
file if no named graph is found:
#!/bin/bash
die () {
echo >&2 "$@"
exit 1
}
# Select the IRI of qb:DataSet's instance
QUERY=$(cat <<-END
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?graph
WHERE {
VALUES (?class ?order) {
(qb:DataSet 1)
(skos:ConceptScheme 2)
}
?graph a ?class .
}
ORDER BY ?order
LIMIT 1
END
)
# Test if arq is installed.
command -v arq >/dev/null 2>&1 || die "Missing Jena ARQ!"
# Change into the dumps directory provided as the first argument of the script
cd $1
# Generate a *.graph file for each *.ttl file.
find * -type f -name "*.ttl" -print0 | while read -d $'\0' file
do
RESULTS=$(arq --results CSV --data $file "$QUERY")
if [ $(echo "$RESULTS" | wc -l) -eq 2 ]; then
echo "$RESULTS" | tail -1 > $file.graph
fi
done
@larjohn The datasets (All ttl-files in the dumps-folder + skosified-folder + auxiliary-data-folder) have been hopefully uploaded to virtuoso_staging. In 30 mins they should be also transferred to virtuso_production. The Sparql endpoint of virtuoso_production is exposed for reading at: http://eis-openbudgets.iais.fraunhofer.de/virtuoso/sparql Hope this is working.
The OBEU model triples are also needed. I uploaded them as well in the staging triple store. But I can only see three datasets. Why is that?
@larjohn sorry there was a command missing on my side for the actual execution of the import. Should now be imported
The following query is returning to many tuples. Why is that?
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xro: <http://purl.org/xro/ns#>
SELECT ?dataset ?attribute ?type ?dsd ?component WHERE {
?type rdfs:subClassOf qb:ComponentProperty .
?dsd qb:component ?component .
?attribute a ?type .
?dataset a qb:DataSet ;
qb:structure ?dsd .
?component ?componentProperty ?attribute .
}
Distinct also times out the triple store.
What do you mean by "too many" results? Do you mean that there are duplicate solutions for this query?
Yes, there are duplicates and their amount does not appear to be consistent - I mean there are more or less duplicates for each solution.
@larjohn I dont know why the duplicates are showing up but at least running the Sparql-Select with "distinct" has worked for me with the endpoint: http://eis-openbudgets.iais.fraunhofer.de/virtuoso/sparql
It now runs faster, I will check and report back
@larjohn every hour a cronjob is running which copies everything from staging to production by replacing the DB files. During that process a timeout can happen.
The "duplicates" have several causes.
First, a component property may be used in multiple qb:ComponentSpecification
in multiple datasets. By the way, I don't know what use does including component specifications in the results has, because they are typically blank nodes.
Second, there are component properties that are instances of more than one subclass of qb:ComponentProperty
. See:
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?attribute (GROUP_CONCAT(DISTINCT ?type; separator = ", ") AS ?types)
WHERE {
[] qb:component [ ?componentProperty ?attribute ] .
?attribute a ?type .
?type rdfs:subClassOf qb:ComponentProperty .
}
GROUP BY ?attribute
HAVING (COUNT(DISTINCT ?type) > 1)
For each instantiated class, you will get one more result per component property.
@badmotor I'm closing this. It must be out of date.
Hello @larjohn and @jindrichmynarz , I just wanted to ask, if we have a default Graph for Virtuoso, since Im supposed to define one using a pre-built Virtuoso Docker image. Thanks