Defintion of "dataset" - Githubissues

jneubert / skos-history

Ontology, processing practices and supporting code for change tracking of SKOS vocabularies

38 stars 9 forks source link

Defintion of "dataset" #5

Closed jneubert closed 10 years ago

jneubert commented 11 years ago

I suppose that we use the term "dataset" with different meanings. To avoid talking on cross purposes, we should review what relevant definition are used in our field, and decide which one to prefer.

jneubert commented 11 years ago

VoiD http://www.w3.org/TR/2011/NOTE-void-20110303/#dataset

The fundamental concept of VoID is the dataset. A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider. Unlike RDF graphs, which are purely mathematical constructs [RDF-CONCEPTS], the term dataset has a social dimension: we think of a dataset as a meaningful collection of triples, that deal with a certain topic, originate from a certain source or process, are hosted on a certain server, or are aggregated by a certain custodian. Also, typically a dataset is accessible on the Web, for example through resolvable HTTP URIs or through a SPARQL endpoint, and it contains sufficiently many triples that there is benefit in providing a concise summary.

Since most datasets describe a well-defined set of entities, datasets can also be seen as a set of descriptions of certain entities, which often share a common URI prefix (such as http://dbpedia.org/resource/).

In VoID, a dataset is modelled as an instance of the void:Dataset class. Such a void:Dataset instance is a single RDF resource that represents the entire dataset, and thus allows us to easily make statements about the entire dataset and all its triples.

The relationship between a void:Dataset instance and the concrete triples contained in the dataset is established through access information, such as the address of a SPARQL endpoint where the triples can be accessed.

The following example declares the resource :DBpedia as a void:Dataset:

:DBpedia a void:Dataset .

The resource is intended as a proxy for the well-known DBpedia dataset [DBPEDIA]. A good next step would be to make this unambiguously clear by adding general metadata and access metadata to the resource.

jneubert commented 11 years ago

RDF 1.1 Concepts and Abstract Syntax http://www.w3.org/TR/2013/WD-rdf11-concepts-20130723/#section-dataset

An RDF dataset is a collection of RDF graphs, and comprises:

Exactly one default graph, being an RDF graph. The default graph does not have a name and MAY be empty.
Zero or more named graphs. Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset.

jneubert commented 11 years ago

Has recently been discussed by Leigh Dodds in http://blog.ldodds.com/2013/02/09/what-is-a-dataset/, citing additional sources, and concluding: "While there’s a common core to these definitions, different communities do have slightly different outlooks that are likely to affect how they expect to publish, describe and share data on the web."

jneubert commented 11 years ago

Relevant in the broader context: http://blog.ldodds.com/2013/02/04/dataset-and-api-discovery-in-linked-data/

JohanDS commented 11 years ago

(RDF) Data Set is defined or described in RDF and VoID. DCAT gois beyond RDF (an XML or excel may also carry a data set). But there is some (deliberate I think) vagueness. A SPARQL end-point can be seen as a data set. A downloaded RDF can be seen as a data set.

A pragmatic point of view may be what an RDF publisher maintains to be a consistent set of RDF triples or quads (graph).

Any SPARQL end-point is one default graph with all triples and a set of graphs each having a subset of those triples.
VoID has some particular description function of what data. O the SKOS mail list I recently proposed a VoID LinkSet (a special data-set) describing a mapping between different taxonomy data sets (e.g. EUROVOC and AGROVOC).
DCAT a data set is anything that for which a record is kept in a register, it can be downloaded as such in different media/format distributions (one of which could be RDF in turtle, n3, xml, ...)

Main point for our purpose I think:

We agree how a thesaurus and a thesaurus version (or variant) is published.
We specify how we publish the relationships between thesaurus versions (or variants)
We specify how we document the difference between thesaurus versions complying with the use cases we are detailing.

jneubert commented 11 years ago

We have a broad common understanding, I think. Especially I share your view that we should not try to re-invent things which VoID (or DCAT, of which I don't know much) have already done well. For STW, we already work with a void.ttl file. Like you, we found linksets great for the description of mappings.

There is only one point re. named graphs (Any SPARQL end-point is one default graph with all triples and a set of graphs each having a subset of those triples.), where we may have a differing understanding: I don't think that the default graph is (or should normally be) the superset of all triples in the named graph. (This behaviour can be implemented, in Jena Fuseki for example by a special switch in the configuration - "tdb:unionDefaultGraph true ;" -, but this is set "false" by default). The default behaviour is that arbitrary different sets of triples can be loaded to the various named graphs und the default graphs. Only triples in the default graph are matched:

An endpoint configured according to the Versions and Deltas as Named Graphs, on

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>

SELECT  ?versionInfo ?issuedDate
WHERE
  { ?scheme a skos:ConceptScheme .
    ?scheme owl:versionInfo ?versionInfo .
    ?scheme dcterms:issued ?issuedDate
  }

returns only information about the current version.

-----------------------------------------------------------------------
| versionInfo | issuedDate                                            |
=======================================================================
| "8.10"      | "2012-03-21"^^<http://www.w3.org/2001/XMLSchema#date> |
-----------------------------------------------------------------------

It requires explicit addressing the named graphs in order to get information about all versions:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX owl:     <http://www.w3.org/2002/07/owl#>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>

# list all available scheme versions

SELECT  ?versionInfo ?issuedDate
WHERE
  { ?scheme a skos:ConceptScheme .
    ?scheme dcterms:hasVersion ?version
    GRAPH ?version
      { ?scheme owl:versionInfo ?versionInfo .
        ?scheme dcterms:issued ?issuedDate
      }
  }
ORDER BY ?version

-----------------------------------------------------------------------
| versionInfo | issuedDate                                            |
=======================================================================
| "8.04"      | "2009-02-16"                                          |
| "8.06"      | "2010-04-22"^^<http://www.w3.org/2001/XMLSchema#date> |
| "8.08"      | "2011-06-30"^^<http://www.w3.org/2001/XMLSchema#date> |
| "8.10"      | "2012-03-21"^^<http://www.w3.org/2001/XMLSchema#date> |
-----------------------------------------------------------------------

So the default configuration works for anybody who just want to get the information about the current version. In my opinion, nobody who is aware of the multiple graph structure (and able to ask such queries) will miss the fact that she has to take some care to get the required results.

JohanDS commented 11 years ago

Sesame and Virtuoso work differently. All triples loaded are in the unnamed default graph. I did not find a specification on this so I think it is RDF store implementation specific.

The discussion on default graph though has no impact on the data-set vs graph argument to specify versioning.

Suppose each released version is a data-set, and we want to load different data-sets into a Jena or SPARQL system, then we are entering the discussion related to implementation. As "default" graph implementations are different in different stores, it will need to be implemented differently. But considering that in TDB, the default graph approach (detailed above) is a named graph (like other graphs), a general description may be attempted.

So how do we manage in Sesame, TDB, Virtuoso, .... the versions (assuming there is only one "current" version - in case there would be variants I imagine those are anyhow managed by named graph)?

Loading a new release:

load all triples of the new release in a version specific named graph

Making a specific version Vx graph the current release

remove all triples from the 'current' graph
copy all triples from the specific version Vx graph to the current release This operation typically can be done by defining the "current" graph as the union of 1 (version specific) named graph.

My conclusion:

The data-set approach is not inhibiting any graph related implementation
A graph related approach will require specific to RDF store and SPARQL implementations
VoID and DCAT typically specify data set as published packages of triples. Graphs fall in another (more conceptual or logic) category

Decision needed to close the issue:

Work on data-set versions or work on graph versions to specify thesaurus version releases.
Considering the above I propose we work on data set. Noting that loading data sets can be easily translated into graph management.

jneubert commented 11 years ago

Well, I completely agree that we should avoid dependencies on a particular technical implication. However, I think that Fuseki (and its underlying TDB database) are implementing the RDF 1.1 dataset/default/named graph approach (as cited above) and SPARQL 1.1 set of standards correctly. The syntax for loading and querying named graphs is quite well standardized. Implementations may differ, though.

Personally, I've no deeper experience with Sesame and Virtuoso. However, a blog post of Bob DuCharme postulates that named graphs are supported by Sesame and Virtuoso. ~~He especially demonstrates that for both triples from different named graphs are not included by default.~~ (edited, see below) (Sesame claims 99.9 % compliance with SPARQL 1.1 Query Language and full compliance with 1.1 protocal and http graph store protocol, as of rel 2.7. For Virtuoso I could not find such information.) Perhaps, if I find the time in the next days, I should try to set up a Sesame store and end point. I've tried to build the example data loading procedure and the queries in an implementation-agnostic way. If they work for Sesame, that could improve trust into the approach.

More generally, as "data-set approach is not inhibiting any graph related implementation", that's true the other way arround, too. Of couse version datasets should be published as bulk downloads (that's the most basic way everybody can easily agree upon). Bulk downloads are maschine-readable, but they are not immeditatly machine-actionable. It would require a lot of custom work by some interested party to come to the point were queries like the example queries can be asked. So, I think if we are able to propose a standardized, not application-dependent way to build and publish such a machine-actionable RDF dataset, it would be valueable.

JohanDS commented 11 years ago

I think either you miss-read Bobs blog (cf. your posting “He especially demonstrates that for both triples from different named graphs are not included by default.”) either I am miss-understanding you. However I think the quote I just made from your posting is wrong.

From Bob’s blog:

1) He loads three named graphs

2) He makes the following select from the default graph and gets results (as stated by Bob) from the three named graphs (quoting from the blog) “ The following test query retrieved the titles from all three graphs, because it has no qualifications about which graphs to retrieve from:

PREFIX dc:http://purl.org/dc/elements/1.1/ select ?title WHERE {?s dc:title ?title} “

My Conclusion: the triples in all named graphs are as well in the SPARQL default graph. (This is also my practice with Sesame and with Virtuoso.)

The whole meaning changes obviously if any combination of FROM, FROM NAMED and GRAPH is used in the SPARQL.

Comments on that blog by Andy Seaborne (March 31, 2009 7:18 AM) and the reply by Lee Feigenbaum (on March 30, 2009 9:57 AM) confirm our discussion and your Jena/TDB experience that different stores may have different implementations about “Default” graph behavior.

LF: ” The SPARQL specification has no concept of "all the graph names". In the absence of any explicitly defined dataset, the query is run against a dataset that is chosen by your implementation (your SPARQL engine).

For some implementations, this means that the query is run against all the graphs that the engine knows about. For other implementations, this means that the query is run against an empty dataset. For still others, an engine may be hardwired to query specific graphs in the absence of an explicitly given dataset. “

As we agree on doing the specification based on dataset, I would like to postpone further discussion about graphs.

jneubert commented 11 years ago

I indeed mis-read the Bob's blog post - sorry for the mess. A mail by Jeen Broekstra confirms the behaviour you described: In Sesame, the default graph always consists of the entire repository.

There are long-lasting discussions on the Sesame mailing list, however, to make this behaviour configurable (http://www.openrdf.org/issues/browse/SES-428, http://www.openrdf.org/issues/browse/SES-849). I've to investigate further how the workarround given in the latter issue works in practice.

jneubert commented 11 years ago

After setting up and playing with a Sesame 2.7.7 endpoint, I found no way to Make default dataset configurable. Even if I somehow could address a Sesame "null context" via URL, this would not hold as a "strong default" to the current version. In an unrestricted query, I'd still get a mix of all versions. Re. Virtuoso, too, I couldn't find any hint on the web that the practical experience described above somehow can be overriden by configuration.

So, adverse to my initial assumption, in the general case the default graph of a RDF dataset in a SPARQL endpoint containing multiple version graphs can't be taken as a reliable source for just the current version. In fact, unrestricted queries will give false answers (e.g., multiple skos:prefLabels for a concept in a given language). Unfortunately, this it also true for the endpoint setup suggested above, with a "current" named graph, and no data loaded explicitely to the default graph.

This state of affairs means a "default" endpoint for a SKOS dataset (in general) can only contain the current version. To take advantage of a setup for querying multiple versions, a separate endpoint had to be defined as suggested above, which had to be asked with caution.

However, the advantage of having a multiple versions dataset online and machine-actionable in my eyes is worth the effort of setting up such a "versions as named graphs" configuration. As the differences in the SPARQL 1.1 implementations are larger than I naively supposed, I'll further investigate (for Sesame to start with). If this succeeds, I'd suggest trying to extend the specification to include a description of endpoints and named graphs in some general way. (I'm aware that this is quite new territory - the RDF Working Group is working on two notes on named graphs and datasets to extend "minimalist design", as Sandro Hawke puts it, in the RDF 1.1 Last Call spec).

As figuring out will take time, I'd agree on postponing discussion about graphs here. In order to avoid misunderstandings triggered by terminology, I'd suggest to use the term "version dataset" when refering to a dataset containing a single version of a SKOS, and "versions RDF dataset" when referring to a RDF dataset containing multiple version graphs.

Could this be a way to proceed?

JohanDS commented 11 years ago

This seems a pragmatic and prudent approach - I favor the "version dataset" and the "versions RDF dataset".

jneubert commented 10 years ago

Recently, RDF 1.1: On Semantics of RDF Datasets (Draft for a W3C Working Group Note) was published, which is based on a one-and-a-half years discussion in the RDF 1.1 WG. It details 8 different options for the meaning of RDF datasets, tries a formalization for each of it and a characterization of their properties.

The RDF Working Group did not define a formal semantics for a multiple graph data model because none of the semantics presented before could obtained consensus. Choosing one or another of the propositions before would have gone against some deployed implementations.

So for the time beeing we should keep with the rough consensus outlined above. I'll close the issue for now - please feel free to re-open.

JohanDS commented 10 years ago

The RDF Semantics 1.1 is beyond what I intended for dataset as it tries to define internal semantics. My intent was to use it as the unit of what is published and made accessible (directly or via download) as detailed by DCAT and VoID ontologies.

I agree there is no need to keep this open.