Make Dataset#equals strict

rubensworks commented 5 years ago

Originally brought up in https://github.com/rdfjs/dataset-spec/issues/18#issuecomment-450806395.

I suggest to make DatasetCore#equals not test for isomorphism, but instead test strict equality. DatasetCore or Dataset could then have an additional isomorphic method.

As mentioned by @bergos, equals could be implemented as simple as result.difference(initial).size === 0.

The main reason I have for changing this is to be consistent with Term#equals. Furthermore, being isomorphic is not the same thing as being equal, so we should not confuse these things IMO.

blake-regalia commented 5 years ago

This actually affects all other operations too. If you only had an isomorphic #equals and all other operations are strict, you could end up with a situation like this:

a.equals(b);  // true

a.difference(b).size === 0;  // false
a.contains(b);  // false
b.contains(a);  // false
let u = a.union(b);
u.size === a.size;  // false
u.size === b.size;  // false
let i = a.intersection(b);
i.size === a.size;  // false
i.size === b.size;  // false

If anything, I would think that if you wanted isomorphism to hold under each operation, you would need to first canonicalize each dataset anyway.

vhf commented 5 years ago

Has this been fixed here https://github.com/rdfjs/dataset-spec/issues/18#issuecomment-453071846 or is it still open?

awwright commented 5 years ago

I think there's good reason to rename Dataset#equals, but I don't think "strict" equality is a useful concept. It's not meaningful to ask if two bnodes are the "same bnode" between graphs except to say they're conveying the same information, and mathematically this means testing if the graphs are isomorphic.

ericprud commented 5 years ago

Bnodes are scoped to a dataset rather than a graph, so it's possible for two graphs in a dataset to be tested for equivalence beyond isomorphism. Practically, it's handy to see if a copy of a graph has changed from the original (user changed their profile, librarian updated a PREMIS record, clinical record is an exact duplicate). Theoretically, the means you may be drawing the dataset boundary around your business process.

Regardless of whether rdfjs includes a .equals() to support such use cases, I agree that it should not be confused with either isomorphism or entailment. Users can supply their own .equals(), so long as the Dataset spec promises not to rewrite bnode labels.

ericprud commented 5 years ago

I created a strawman to describe the diff between equals and isomorphicTo.

(BTW, those anchors have "dom-" in them. I wonder if that's intended to convey that it applies to the DOM spec, or if it's just that the anchors are created in (our) DOM.

awwright commented 5 years ago

First, IMO, I don't think two graphs can share a bnode even if RDF Semantics says they can, as a matter of mathematics, because that would defeat the point of having bnodes being an existential quantifier: you can't conditions to an existential quantifier after the fact. But that's a quibble for that specification. Maybe I'm misunderstanding something, or maybe it's not a problem for some reason I haven't considered.

More relevant to this is, what case do you need to compare bnodes for "strict" equality because isomorphism checking doesn't suffice?

ericprud commented 5 years ago

Many use cases for this violate your position on the scope of bnodes as they count on some external reference to a bnode, either from another part of the dataset or from some program state. In order to duck this discussion, I'll use a subgraph use case; the results of a .getQuads() operation:

_:alice foaf:knows _:bob ;
  foaf:mbox <mailto:alice@example.com> .
_:bob foaf:mbox <mailto:bob@example.com> .
_:claire foaf:mbox <mailto:claire@example.com> .

let knows = f.namedNode(foaf:knows)
let alice = d.getQuads(null, knows, f.namedNode(<mailto:alice@example.com>))
let known1 = d.getQuads(alice, knows)
// UI process allows user to update `_:alice`'s `foaf:knows` to be `_:claire`
let known2 = d.getQuads(alice, knows)
console.log(known1.equals(known2)) // false
console.log(known1.isomorphicTo(known2)) // true

Having a .equals() isn't strictly necessary as the user can always implement it; but it is:

handy for testing and change detection
much cheaper than .isomorphicTo().

awwright commented 5 years ago

@ericprud Can you elaborate the example? As I take your example, nothing changed because the graph conveys the same information before and after: Before, "there exists a person who knows Alice" And after, "there exists a person who knows Alice"

ericprud commented 5 years ago

The part of the graph we're examining started out saying:

There exists someone with the mbox mailto:alice@example.com . There exists someone with the mbox mailto:bob@example.com . There exists someone with the mbox mailto:claire@example.com . The person with mbox mailto:alice@example.com knows the person with mbox mailto:bob@example.com.

The use case says that the user changed the graph to say:

There exists someone with the mbox mailto:alice@example.com . There exists someone with the mbox mailto:bob@example.com . There exists someone with the mbox mailto:claire@example.com . The person with mbox mailto:alice@example.com knows the person with mbox mailto:claire@example.com.

The way the program detected that those particular triples changed (many other things may have changed in the graph as well, but we specifically saved state capturing who alice knew), was by asking again for who alice knows and comparing the results with .equals().

vhf commented 5 years ago

I think @ericprud's definition is clear (and I really like the other WebIDL changes).

Here is the example in code:

const assert = require('assert')
const rdf = require('some dataset-spec compliant lib')

const first = (dataset) => dataset.toArray()[0]

const knows = rdf.namedNode('http://xmlns.com/foaf/0.1/knows')
const mbox = rdf.namedNode('http://xmlns.com/foaf/0.1/mbox')
const alice = rdf.blankNode('alice')
const bob = rdf.blankNode('bob')
const claire = rdf.blankNode('claire')
const mailto = (address) => rdf.namedNode(`mailto:${address}`)

const dataset = rdf.dataset([
  rdf.quad(alice, knows, bob),
  rdf.quad(alice, mbox, mailto('alice@example.com')),
  rdf.quad(bob, mbox, mailto('bob@example.com')),
  rdf.quad(claire, mbox, mailto('claire@example.com'))
])

const alice2 = dataset.match(null, mbox, mailto('alice@example.com')).toArray()
assert(alice2[0].subject.equals(alice))

const known1 = dataset.match(alice, knows)
assert(first(known1).object.equals(bob))

dataset.delete(rdf.quad(alice, knows, bob))
dataset.add(rdf.quad(alice, knows, claire))

const known2 = dataset.match(alice, knows)
assert(first(known2).object.equals(claire))

assert(!known1.equals(known2))
assert(known1.isomorphicTo(known2))

Here's how I had implemented .equals: https://github.com/rdf-ext/rdf-dataset-indexed/blob/ff76073bb73c96eea77b3c7c64917733a399b15a/Dataset.js#L36-L46 , it works in the above example case although it's a bit different than the suggested implementation.

A suggestion to move forward with this:

Merge @ericprud's work without the .isomorphicTo part
Create a new issue here to discuss .isomorphicTo (do we want it to be part of the spec? is it a feature that would not be used that often and requires a ~big implementation so we leave it out? etc)

If most of you agree with this: @ericprud do you want to take care of creating this PR without .isomorphicTo, otherwise can I cherry-pick from your repo and open it?

rubensworks commented 5 years ago

Merge @ericprud's work without the .isomorphicTo part

👍

Create a new issue here to discuss .isomorphicTo

Implementation of this is indeed not that trivial (I have one here). From a user-perspective, this would definitely be useful to have. However, this would probably not be an operation that is needed for most cases, so I currently lean towards not including it in the spec, to lower the burden of implementors. Furthermore, there are different ways to implement .isomorphicTo, so it would make sense to give developers control over this, so that an implementation can be chosen that is optimal for the use-case.

awwright commented 5 years ago

I still have to revisit this, but having a user select between one bnode or another bnode seems like the wrong way to implement a program. Blank nodes are existential quantifiers, not identifiers. It fundamentally doesn't make sense to me for a user to change a node in a statement from one bnode to another one. If you want this functionality, use an identifier (URI/IRI).

Even using the example for the sake of argument, you can still use isomorphism here. What I want to know is, what does a strict check do that isomorphism cannot?

Performance shouldn't be that much of an issue. Let n be the number of statements; if the two graphs contain the same bnodes, or are found in the same internal sort order, the performance will be O(n), same as strict equality. Only if the bnodes are different addresses in memory, and they're in a different internal sort order, will the performance approach O(n!).

to lower the burden of implementors

On the contrary, I think it's important to standardize the things that are most difficult to implement. The whole point of having a library is to remove burden from developers. It wouldn't be much of a library if all the difficult stuff always got shoveled down the road.

awwright commented 5 years ago

Also, here's my simple implementation for reference, which additionally returns the mapping of bnodes if such a mapping is found: https://github.com/awwright/node-rdf/blob/master/lib/GraphEquals.js

I haven't done much work between the differences in the algorithms; mine is a simple subgraph matching algorithm instead of e.g. Jeremy Carroll's algorithm; but as I understand, the pathological case is always O(n!) and the best we can do is going to be for certain subsets of cases, like preserved sort order, and such.

ericprud commented 5 years ago

So where is consensus at this point? My reading of the above is that most folks want to merge in my strict equals() definition but that @awwright remains unconvinced that it's a practical or proper way to manipulate RDF data.

awwright commented 5 years ago

Well I wouldn't say unconvinced, but I'm looking for persuading information. If there's an application of this, where it needs to know some Datasets have the same statements pointing to the same bnodes, and mere isomorphism is not allowed, then I can consider that.

With such an example in mind, there'd be some follow-up questions. Is the name appropriate? The name "strict" sounds like the === operator which is usually preferable in ECMAScript; but for us, according to RDF Semantics, it's unpreferable.

Finally, in general, there's just a bunch of different ways to test for equality:

Are two datasets the same instance in memory?
Do two different instances of Dataset point to the same instances of statements? (this could be especially important if Quad is mutable!)
Are two Datasets different if they have different bnodes with the same label?
Do the Datasets share the same set of named graphs, and are each pair isomorphic?

The only definition of "equals" that ensures two graphs encode the same data as each other is isomorphism; developers who need something different might be better suited by specifically writing what they're looking for. Isomorphism is the only hard test on that list, anyways.

And again, performance isn't too much of an issue. If two graphs don't use any bnodes, or store statements in the same internal sort order, or point to the same bnode instance, or use the same bnode label, those cases can all be optimized down to O(n).

rdfjs / dataset-spec

Make Dataset#equals strict #23