w3c / rdf-star

RDF-star specification
https://w3c.github.io/rdf-star/
Other
120 stars 23 forks source link

Support Wikidata/Wikibase data model #36

Closed akuckartz closed 3 years ago

akuckartz commented 3 years ago

As a member of the Wikidata community I would like to see triple stores supporting the Wikidata/Wikibase data model as much as possible, so that provenance information etc. can be represented in a way which is pleasing to the mind and software-systems such as Wikibase.

See:

pfps commented 3 years ago

I note that the first use case in https://w3c.github.io/rdf-star/tests/semantics/manifest.html appears to make RDF* incompatible with Wikidata.

pchampin commented 3 years ago

https://www.mediawiki.org/wiki/Wikibase/DataModel#Overview_of_the_data_model says

A Statement consists of two parts: a claim that something is the case (e.g., the claim "Berlin has a population of 3,499,879") and a list of references for that claim (e.g., a publication by the statistical office for Berlin-Brandenburg).

The way I see it, Wikidata can be modelled by using RDF* triples for representing statements, as in:

<< :Berlin :population 3499879 >> :references ( :ref1 :ref2 :ref3 ).

or for represent claims, as in:

<< :Berlin :population 3499879 >> :claimOf [ a :Statement; :references ( :ref1 :ref2 :ref3 ) ].

I know that claims in Wikidata can be more complex than a single triple, so I think the 2nd option may be better (as the subject of :claimOf could be something else than a single triple).

But I don't see any incompatibility here...

pfps commented 3 years ago

The incompatibility is that RDF* requires that embedded triples be unique, if the first test case is required. The Wikidata data model allows multiple snaks on the same same entity with the same predicate and value.

pchampin commented 3 years ago

Nitpicking: since the Wikidata model has an RDF representation, and since RDF is a superset of RDF, claiming that RDF is incompatible with Wikidata seems inaccurate.

That being said, it is possible, of course, that the approaches proposed above do not adequately capture the Wikidata model, but I still fail to see how... Following my 2nd proposal, multiple statements about the same subject-predicate-value can still be modelled like this:

<< :FridaKhalo :spouse :DiegoRivera >> :claimOf [
    a :Statement ;
    :start_time 1929;
    :end_time 1939;
], [
    a :Statement ;
    :start_time 1940;
    :end_time 1954;
].

Whether this is "pleasing to the mind and (...) Wikibase" is debatable, of course...

pfps commented 3 years ago

If you are going to argue from RDF being a superset of RDF then there is no point in using RDF at all - just use RDF.

I take the request to having embedded triples in RDF* be Wikidata statements, and this doesn't work if there is only one embedded triple with the same subject, predicate, and object as Wikidata allows multiple statements with the same subject, predicate, and object.

pchampin commented 3 years ago

I take the request a little more broadly than "having embedded triples in RDF be Wikidata statements" (which indeed would not work). I read it more as "can we do better with RDF than with RDF?".

akuckartz commented 3 years ago

I read it more as "can we do better with RDF* than with RDF?".

Yes, that was and is the intended meaning.

rat10 commented 3 years ago

I read it more as "can we do better with RDF* than with RDF?".

Yes, that was and is the intended meaning.

There is one thing that is special with WikiData and that is, as @pfps notes, "The Wikidata data model allows multiple snaks on the same same entity with the same predicate and value.". If this is not your concern then I guess we could replace "WikiData" with just about any project out there that uses semantic web technology, couldn't we? Well, it is a problem that occurs also outside of WikiData, but 'WikiData' is a good keyword for it. Don't let @pchampin talk you out of it ;-) The set based semantics of RDF make it quite hard to "naturally" represent multiple tokens of the same type in one and the same graph. It's possible but one has to jump through one hoop or another:

Renaming is always a very annoying technique as it works against the advantages of using a shared vocabulary, so I'll dismiss it right away without much ado (until later below). Graphs are a powerful mechanism but as used in RDF are rather a one trick pony. Should they be used for this rather mundane usecase they would not be available for more mainstream uses - unless there was an approach to nest graphs, but there isn't - so this approach can be ruled out as well. Modelling single annotations as regular RDF triples but multiple annotations in a nested structure has two negative effects. For one in case a second annotation is added later the first one has to be rewritten to allow both to coexist. Secondly querying has to take into account both modelling variations as one is probably not aware beforehand of the exact structure of the data. One might even miss multiple annotations if one only queries for the simple case. One could of course also go the verbose route and model for multiple annotations all the time, even if there is only one. In any case, sooner or later this technique is verbose, annoying and error prone. With RDF standard reification each reification has its own name, and multiple reifications of the same triple have multiple names. Now let's assume RDF is defined as syntactic sugar on top of RDF standard reification (syntactic sugar in the sense that its syntax is RDF but the semantics are taken from RDF reifiaction). To put the naming feature of reification into effect, the embedded triple would have to be amended by an identifier, like '<< :a :b :c :# >>'. The identifier would of course be optional and only be defined when multiple annotations do indeed occur. Queries could specify it to narrow down result sets but wouldn't need to do so to see all results. RDF/SPARQL* engines would have to parse each embedded triple for a potential identifier attribute and render results accordingly.

This usecase is per se orthogonal to the distinction between provenance-driven reification and Property Graph-style "graded" modelling of primary relations with secondary attributes. So it's not "naturally" a fit for reification. If however for one reason or another we decide that RDF* is semantic sugar for RDF reification then this could be a nice side effect.

TallTed commented 3 years ago

I am seeing something here, which has not been so blatant to me before, and which troubles me greatly about the RDF* proposal. That is --

RDF* requires that embedded triples be unique

Why is this required? And how can it be enforced?

Just as the same triple may be uttered in many RDF graphs by many utterers, I can see endless ways that the same embedded triple may be uttered in many RDF* graphs by many utterers, all without any communication between utterers.

This uniqueness requirement simply boggles my mind.

<< :Joanie :loves :Chachi >>
   a          rdf:triple ;
   :utteredBy :RalphMalph , 
              :PotsieWebber , 
              :RichieCunningham .

-- is OK, but --

<< :Joanie :loves :Chachi >>
   a          rdf:triple ;
   :utteredBy :RalphMalph .
<< :Joanie :loves :Chachi >>
   a          rdf:triple ;
   :utteredBy :PotsieWebber .
<< :Joanie :loves :Chachi >>
   a          rdf:triple ;
   :utteredBy :RichieCunningham .

is not‽

If this is really a requirement/dictate of RDF*, then the project seems doomed from the outset.

(This also breaks from the base proposition that RDF* is all about annotating triples. Sure, it'd be nice to add all annotations in one place at one time, but in the real world, annotations are going to become apparent at different times, and someone is going to need to add annotations of a triple in January, and April, and November -- and it makes no sense to require that they delete the full package of whatever was present in January in order to re-insert that with the addition of April's notes ... and repeat that in November.)

pchampin commented 3 years ago

This uniqueness requirement simply boggles my mind.

To rephrase it (in a hopefully less mind-boggling way): this piece of Turtle <<:a :b :c>> denotes the same thing everywhere it occurs. But it can of course occur in multiple Turtle files, or even multiple times in the same file.

At least, this is how RDF* was defined in the original papers.

pfps commented 3 years ago

@TallTed I think you may have this the wrong way around. It's not that an embedded triple can only occur in one place (in a document, in the web, in the universe). It's that the embedded triple is the same everywhere. It's just like the situation for literals. "4"^^xsd:int can occur in lots of places, but each place that it occurs it means the integer 4 (assuming that xsd:int is a recognized datatype everywhere).

TallTed commented 3 years ago

@pfps @pchampin

Ah! Thank you.

I think that saying that the << ... >> lexical "must be unique" is backwards, and is going to cause confusion for many more people than just me.

I think it would be better to say that, similar to URI/URL/URN/IRIs, each << ... >> (and now {| ... \}) lexical must map to one meaning ("n-to-one", where "n" might be one, and might be many ... see below).

I'm wondering whether multiple lexicals may map to the same meaning ("many-to-one", a/k/a "coreference", again akin to URI/URL/URN/IRIs). If so, I think this should also be clearly stated.

kidehen commented 2 years ago

It's just like the situation for literals. "4"^^xsd:int can occur in lots of places

True, but you can't use literals (typed or untyped) as relation subjects in RDF -- as it is currently specified.

pfps commented 1 year ago

@akuckartz The RDF-star working group is taking the use cases from the community group and expanding them to provide guidance for the development of RDF 1.2. (You probably know all of this already, but I'm including it for the record.)

Are you interested in interacting with the working group (probably mostly me) in expanding this use case?

akuckartz commented 1 year ago

@pfps Simple answer: yes.

pfps commented 1 year ago

@akuckartz Great. The next step is to create an issue in the RDF-star WG UCR repository at https://github.com/w3c/rdf-ucr/issues that will serve as the base point of a discussion on just what will go into the use case. You can do this yourself or email me at pfpschneider@gmail.com and we can discuss how to get the process started.

One thing that would be very useful to know is just how much of the Wikidata model you want to support.

pfps commented 1 year ago

@akuckartz Do you want to create the issue or should I and then you can comment on it?

akuckartz commented 1 year ago

@pfps I can not promise to create a new substantial issue soon (which would not simply be a copy of the current one). So it would be great if you can do that.