Add material in the spec about the triple-occurrence distinction

pchampin commented 3 years ago

Aims to addresses issue #64 by

fixing misleading examples (prov:wasDerivedFrom)
adding an informative section on the type-token distinction, and providing guidance on how to model occurrences with RDF*
adding an apologetic appendix on the controversial seminal example (with @hartig 's permission :wink:)

pfps commented 3 years ago

I am against using type-token for this distinction. Consider URIs. Are ex:a and ex:a two different tokens for the same type? No! <:a :b :c> and <:a :b :c> are not either nor are "1"^^xsd:int and "1"^^xsd:int. The RDF* documents should use wording that makes sense if used for IRIs or RDF literals.

pchampin commented 3 years ago

Consider URIs. Are ex:a and ex:a two different tokens for the same type?

Yes, if I read correctly https://plato.stanford.edu/entries/types-tokens/#WhaDis .

Rose is a rose is a rose is a rose.
In one sense of ‘word’ we may count three different words; in another sense we may count ten different words. C. S. Peirce (1931-58, sec. 4.537) called words in the first sense “types” and words in the second sense “tokens”.

How many IRIs do you count in this?

rdfs:Class rdf:type rdfs:Class;
           rdfs:subClassOf rdfs:Class.

I count 3 IRI types, and 5 IRI tokens.

I don't consider the terms of RDF (IRIs, literals...) and RDF* (+triples) to be different, in that respect, from the terms of the English language.

Of course, we are not talking here about what the terms denote (as in the example in the link above: in "an 8,000 year old bean", does the word "bean" denote a bean type or a bean token?). We are talking about terms types and term tokens.

pfps commented 3 years ago

I do not agree.
https://plato.stanford.edu/entries/types-tokens/#WhaItNot

  Although the matter is discussed more fully in §8 below, it
    should
    be mentioned here at the outset that the type-token distinction
    is not
    the same distinction as that between a type and (what logicians
    call)
    its occurrences. Unfortunately, tokens are often
    explained as
    the “occurrences” of types, but not all occurrences of
    types are tokens. To see why, consider this time how many words
    there
    are in the Gertrude Stein line itself, the line type,
    not a
    token copy of it. Again, the correct answer is either three or
    ten, but
    this time it cannot be ten word tokens. The line is an
    abstract type with no unique spatio-temporal location and
    therefore
    cannot consist of particulars, of tokens. But as there are only
    three
    word types of which it might consist, what then are we counting
    ten of?
    The most apt answer is that (following logicians' usage) it is
    composed of ten occurrences of word types. See §8
    below,
    Occurrences, for more details.

Further, type is used in RDF and ontologies as the relationship
  between an entity and classes that it belongs to (e.g., rdf:type)
  so it is better to avoid other possible meanings of type.

peter
PS:  I count 3 IRIs (actually 3 CURIES).  If I have to
  distinguish further, I count 5 occurrences of IRIs (CURIEs).  I
  count zero IRI (or CURIE) types and zero IRI (or CURIE) tokens.

On 12/17/20 4:22 PM, Pierre-Antoine
  Champin wrote:

    Consider URIs. Are ex:a and ex:a two different tokens for the
      same type?

  Yes, if I read correctly https://plato.stanford.edu/entries/types-tokens/#WhaDis
    .

    Rose is a rose is a rose is a rose.

    In one sense of ‘word’ we may count three different words; in
      another sense we may count ten different words. C. S. Peirce
      (1931-58, sec. 4.537) called words in the first sense “types”
      and words in the second sense “tokens”.

  How many IRIs do you count in this?
  rdfs:Class rdf:type rdfs:Class;
       rdfs:subClassOf rdfs:Class.

  I count 3 IRI types, and 5 IRI tokens.
  I don't consider the terms of RDF (IRIs, literals...) and RDF*
    (+triples) to be different, in that respect, from the terms of
    the English language.
  Of course, we are not talking here about what the
    terms denote (as in the example in the link above: in
    "an 8,000 year old bean", does the word "bean" denote a bean
    type or a bean token?). We are talking about terms types and
    term tokens.
  —
    You are receiving this because you commented.
    Reply to this email directly, view it on GitHub, or unsubscribe.
  [

{ "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/w3c/rdf-star/pull/75#issuecomment-747709535", "url": "https://github.com/w3c/rdf-star/pull/75#issuecomment-747709535", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

pchampin commented 3 years ago

type is used in RDF and ontologies as the relationship between an entity and classes that it belongs to (e.g., rdf:type) so it is better to avoid other possible meanings of type.

What about peirce:type, then? :wink:

More seriously: I was not advocating to use terms like "IRI type" and "IRI token" in the report -- as I didn't use them in my PR. I was only using Peirce's terminology in this discussion to show how the Type-Token distinction is, in my opinion, appropriate in this situation.

And yes, I also read the part about "what it is not" and occurrences. What I take away from it is: if you are counting IRIs in a graph or a triple, you are dealing with IRI occurrences. But if you are counting them on your screen or a printed page, then you are dealing with IRI tokens.

pfps commented 3 years ago

I would call all occurrences of IRIs IRI occurrences, whether they are in a refresh of a screen (at 60 occurrences a second or more), on a screen, in a vocal utterance, in an email, in an email message, in a document, in a triple, or in a graph. Tokens are used in too many other ways in computer science that could be used when talking about RDF and RDF*, so the "kind" of token needs to be distinguished so using the more generic occurrence ends up being less confusing.

As a case in point, programming languages have tokens (e.g., while) and these tokens have occurrences in code.

pfps commented 3 years ago

The seminal example discussion is misguided. Even if there is only stated-by information, and even if embedded triples are sensitive to syntax, a source states some occurrence of information. The relationship between a source and a syntactic unique embedded triple is something like "stated an occurrence of" or "stated something that is expressed as". Just state that this example is wrong and either stop there or give a good example with the needed indirection.

pchampin commented 3 years ago

The relationship between a source and a syntactic embedded triple is something like "stated an occurrence of" or "stated something that is expressed as"

Fully agreed. And I considered that dct:source's definition was loose enough that one could interpret it like that.

or give a good example with the needed indirection

There is a link to the example from section 2.1

pchampin commented 3 years ago

This was discussed during today's call: https://w3c.github.io/rdf-star/Minutes/2020-12-18.html#item04

rat10 commented 3 years ago

I would call all occurrences of IRIs IRI occurrences, whether they are in a refresh of a screen (at 60 occurrences a second or more), on a screen, in a vocal utterance, in an email, in an email message, in a document, in a triple, or in a graph. Tokens are used in too many other ways in computer science that could be used when talking about RDF and RDF*, so the "kind" of token needs to be distinguished so using the more generic occurrence ends up being less confusing.

As a case in point, programming languages have tokens (e.g., while) and these tokens have occurrences in code.

I just read up on the topic in SEP (https://plato.stanford.edu/entries/types-tokens/). It seems indeed more appropriate to speak of 'types' vs 'occurrences' in our context as we are not concerned with renderings of triples on screens and printouts, but replacing 'token' by 'ooccurrence' everywhere seems extreme: e.g. we might like to differentiate between different serializations of the same occurrence (say in N-triples vs Turtle) by calling them 'tokens'. Then there is also the mismatch to the RDF 1.1 Semantics specification which talks about 'types' and 'tokens' but never uses the term 'occurrence', so we should explain our choice of terms.

TallTed commented 3 years ago

@pchampin -- Note that there are currently conflicts that must be resolved before this can be merged, and the resolution of those conflicts may change the ongoing discussion as they will change the reading of the PR's results. (I and most reading this cannot see what the conflicts are, only that they exist ... )

pfps commented 3 years ago

There appears to be three changes in PR #75: a new section on triple occurrences, a new section on the seminal example, and the moving of the original RDF* paper from an informative to a normative reference. I find all three of these changes problematic.

Does the RDF specification depend on anything in the original RDF paper? I heard not. So the original RDF* paper should not be normative.

Is the seminal example problematic? Yes, very! So it should be disavowed. The section on it does not and instead should read something like "The seminal example was wrong and should be completely disregarded."

What is an embedded triple? Is it something like a literal or something like an IRI? The new section is firmly on the side of literal. This has consequences. My view is that an embedded triple does not need to be something like a literal.

pchampin commented 3 years ago

@pfps

So the original RDF* paper should not be normative.

Respec put it there, because of a mistake I made: I thought that appendices were automatically marked as "informative", but this is not the case. I marked the corresponding appendix as "informative", and the reference is back in the correct place. Thanks for spotting this.

For the other issues, I think it better to discuss them during the call.

rat10 commented 3 years ago

I think section 2.1 is a step in the right direction and the main load of appendix A.2 should be incorporated in section 2.1. Like:

explain the problem and a possible workaround in 2.1 with the seminal example from Appendix A.2.
reduce the appendix to a mere mention of the historic development and a clarification that RDF* is now not the same as it was then, and until recently.

It’s more important that people get it right now and in the future and that they are warned about a subtle but important change than that they understand what the historic background is.

Appendix A.2 doesn’t explain what motivated the change. The explanations I could come up with are all not very flattering (like "at all cost avoid the semantic muddle that named graphs represent" or "we are so enarmored with the Superman problem" or "representing unasserted assertions suddenly seemed so tremedously important"). Of course if someone could put it a little more positively it would be good to add such an explanation about the motivation for the change as well.

pfps commented 3 years ago

@rat10 Which change to RDF* affects the seminal example?

rat10 commented 3 years ago

@rat10 Which change to RDF* affects the seminal example?

There was a discrepance between the seminal example referring to an occurrence and RDF not saying how it addresses that occurrence. For a long time I considered this an oversight, sloppy engineering and/or something that people need to be educated about. The change was when a few weeks ago it became clear that RDF will be specified in a way that makes the seminal example wrong, a regrettable mistake, whatever. Technically this is not a change in RDF* but just in the examples. Practically it is a change , and very much so IMO, which is why I find it important that section 2.1 addresses the problem rather comprehensively.

pfps commented 3 years ago

Here is something that is closer to what is needed for A.2. I would prefer something stronger and shorter but if there is a desperate need for a longer section on the seminal example, this wording at least lays out the situation more clearly.

A.2 The seminal example

The motivating example in the original RDF* paper [ RDF-STAR-FOUNDATION ] was on a provenance use-case, and is repeated below.

Example 19

:bob foaf:name "Bob". <<:bob foaf:age 23>> dct:creator http://example.com/crawlers#c1 ; dct:source http://example.net/listing.html .

This example is incorrect because there is a need to have multiple creators and related sources for the same embedded triple. There is only one entity for an embedded in triple in RDF* so the source corresponding to a creator cannot be distinguished if provenance is represented in this manner. Because of this, the example has given rise to significant confusion.

To rescue the example requires an intermediate entity to represent the stating of a triple, as in

<<:bob foaf:age 23>> ex:stating [ dct:creator http://example.com/crawlers#c1 ; dct:source http://example.net/listing.html ] .

This corrected example shows that embedded triples can require more complex solutions than using RDF reification directly.

pfps commented 3 years ago

@rat10 I agree that the seminal example only works if there are multiple entities for an embedded triple, but the semantics in the document had a single entity for an embedded triple and the newer semantics have also worked this way. So no change to the meaning of RDF*, just an incorrect and misleading example. (Well, using "just" here is really downplaying the seriousness of the situation. Misleading examples can do vast amounts of damage.)

TallTed commented 3 years ago

I remain concerned that the merge conflicts on this PR may be concealing text that ought to be highlighted during our review and consideration of this PR. Please, can those be addressed soon?

pchampin commented 3 years ago

This was discussed during today's call https://w3c.github.io/rdf-star/Minutes/2021-01-15.html#item02

pchampin commented 3 years ago

@TallTed conflict resolved

hartig commented 3 years ago

Thanks for the input @pfps and @rat10, and apologies for not participating in yesterday's call (it was my daughter's birthday and I couldn't sneak out of the activities).

Some responses to what you write above and what was mentioned in some emails on the list:

I concur with Peter's comment that there is no change from the original paper to our current draft in terms of the meaning of RDF* and, instead, it is only the original example that has been wrong and misleading. I appreciated your continued input to make sure we get the examples in the draft right.

I realize now that there are actually two separate mistakes that I made when writing this example.

One of them is related to the distinction between the notion of an RDF triple as a single entity (whose identity is defined entirely by the three RDF terms it consists of) and the notion of an occurrence of such a triple. I simply didn't make this distinction. I think that the new Section 2.1 in the current version of this PR here does a good job raising the readers' awareness of the need to make this distinction.

As a side node related to this mistake, I actually don't think that anywhere in the original paper the examples explicitly refer to occurrences of triples; I mean, nowhere in the text of the paper do I say that the example is about providing metadata for a specific occurrence of the triple about Bob's age. However, I can see now that readers may interpret the data in the example in this way, and that's one of the reasons the example is badly chosen as I understand now (and another reason is that I should have made the distinction as mentioned above).

The other mistake I made in the example is that the modeling of the metadata is insufficient. Instead of using the embedded triple about Bob's age directly as the subject of the two metadata triples, I should have introduced a separate entity that captures the creation of this triple (or, more precisely, as I know now, the creation of an occurrence of the triple) and then represent the information about this creation (i.e., creator and source) as triples with this separate entity as their subject.

A side node related to this mistake: I realize now that the given metadata triple with predicate dct:source may be interpreted to state where the embedded triple occurs. The current version of Section A.2 in this PR seems to say that this interpretation may have been implied by the example. However, that was not the intention of having this metadata triple in the example. In contrast, the intention was that this metadata triple specifies the HTML file that was the source used for the creation by the crawler.

Now, to address these mistakes, the RDF graph of the example can be changed as follows (written in Turtle, prefix declarations omitted).

_:a  :occurrenceOf  <<:bob foaf:age 23>> .
_:a  :creation  _:b .
_:b  dct:creator  <http://example.com/crawlers#c1> .
_:b  dct:source  <http://example.net/listing.html> .

Notice that the blank node labeled as _:a represents the occurrence of the triple about Bob's age, and the blank node labeled as _:b represents the entity that captures the creation of that triple occurrence.

As a final remark here, after these fixes, I would say that this example is not suitable anymore as the main first example to present the basic idea of RDF. On the other hand, I think it is suitable now as an example that demonstrates that RDF can be used as a building block to capture provenance use cases.

hartig commented 3 years ago

@pfps in your comment above you write:

To rescue the example requires an intermediate entity to represent the stating of a triple, as in

<<:bob foaf:age 23>> ex:stating [ dct:creator http://example.com/crawlers#c1 ; dct:source http://example.net/listing.html ] .

This corrected example shows that embedded triples can require more complex solutions than using RDF reification directly.

While I am fine with the corrected example (it is a variation of the fix in my previous comment but without capturing the creation as a separate entity), I don't understand your remark about the "more complex solutions."

What you write in your corrected example are essentially three triples and one of them contains an embedded triple. So, it's four triples overall if we count the embedded triple as an independently. Let's compare this to RDF reification. While I am not entirely sure what you mean by "using RDF reification directly," I assume you mean to represent the example as follows (correct me if I am wrong).

_:x  rdf:type  rdf:Statement .
_:x  rdf:subject  :bob .
_:x  rdf:predicate  foaf:age .
_:x  rdf:object  23 .
_:x  dct:creator <http://example.com/crawlers#c1> .
_:x  dct:source <http://example.net/listing.html> .

In this representation, I am counting six triples.

Can you explain what you mean by "more complex"?

pfps commented 3 years ago

RDF reification has statements, which are underspecified. If one desires, one can use statements as, essentially, triple occurrences. No extra kinds of entities are required. RDF has embedded triples. In many cases there is also a need for triple occurrences. So two new kinds of entities. As two is greater than one, I view the solution in RDF as more complex.

hartig commented 3 years ago

Interesting perspective! Without you mentioning it, I may never have looked at it this way. Probably that is because the number of entities is typically not so relevant for systems (in contrast to the number of triples that need to be stored and processed).

pchampin commented 3 years ago

@pfps

What is an embedded triple? Is it something like a literal or something like an IRI? The new section is firmly on the side of literal. This has consequences. My view is that an embedded triple does not need to be something like a literal.

As you know, my position is that it is something like a literal. But I agree with you that this section does not need to be "tainted" by this assumption. In the latest commit (d0d948f), I slightly rephrased §2.1 to make it neutral w.r.t. this question.

pchampin commented 3 years ago

@pfps

As two is greater than one, I view the solution in RDF* as more complex.

The way I see it, the solutions in RDF* are less complex, because the language is more expressive We trade additional complexity in the language itself (adding an extra kind of terms) for less complexity in the graphs using that language (less triples) and the serializations encoding these graphs (thanks to <<..>> and {|...|}).

w3c / rdf-star

Add material in the spec about the triple-occurrence distinction #75