sweetrdf / rdfInterface

MIT License
6 stars 2 forks source link

Expected behavior of non-string literal comparison #15

Open zozlak opened 3 years ago

zozlak commented 3 years ago

We should decide and include in the test suite tests for:

$l1 = DataFactory::literal('1');
$l2 = DataFactory::literal(1);
$l3 = DataFactory::literal('1', null, 'http://www.w3.org/2001/XMLSchema#int');
$l4 = DataFactory::literal(1, null, 'http://www.w3.org/2001/XMLSchema#int');
$l5 = DataFactory::literal('01', null, 'http://www.w3.org/2001/XMLSchema#int');

$l1 == $l2; // true or false?
$l2 == $l3; // true or false?
$l3 == $l4; // true or false?
$l3 == $l5; // true or false?
$l4 == $l5; // true or false?

From the RDF specification perspective (https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal):

According to that (language tags skipped as they are always missing here):

comparison left term lexical form right term lexical form left term datatype right term datatype result
$l1 == $l2 "1" "1" xsd:string xsd:string true
$l2 == $l3 "1" "1" xsd:string xsd:int false
$l3 == $l4 "1" "1" xsd:int xsd:int true
$l3 == $l5 "1" "1" xsd:int xsd:int true
$l4 == $l5 "1" "01" xsd:int xsd:int false
zozlak commented 3 years ago

@k00ni what do you think? Should we strictly follow RDF specification or not?

All in all there might be many solutions like:

k00ni commented 3 years ago

Saying that all values are strings and their type is determine by $datatype is clean.

So in a check-scenario we cast value (if necessary) and do a ===?

zozlak commented 3 years ago

Yes, Literal::equals(), should (according to the RDF spec) look more or less like that:

public function equals(\rdfInterface\Term $term): bool {
   return $term instanceof \rdfInterface\Literal && 
        (string) $this->getValue() === (string) $term->getValue() && 
        $this->getDatatype() === $term->getDatatype() &&
        $this->getLang() === $term->getLang()
}

Anyway here the question is more about "what we think the Literal constructor should do"?

  1. Should it try to set datatype depending on the value's PHP type? (like in RDF-turtle for some numeric data types) And if so, which PHP types should be recognized? This would e.g. make $l2 == $l3 in the example in the description.
    • We may see it as convenient for PHP programmers but we can also see it as overengineering .
  2. Or should it cast the value to the type (possibly implicitly) determined by $lang and $datatype parameters? (making e.g. $l4 == $l5)
    • We may see it intuitive for programmers and therefore desired but we can also argue it breaks point 2.b of this part of the RDF spec "Otherwise, the literal is ill-typed and no literal value can be associated with the literal. Such a case produces a semantic inconsistency but is not syntactically ill-formed. Implementations MUST accept ill-typed literals and produce RDF graphs from them. Implementations MAY produce warnings when encountering ill-typed literals."
    • Requesting it would make rdfInterface implementation more troublesome as while the list of allowed RDF literal types isn't very long, it's definitely not short (see https://www.w3.org/TR/rdf11-concepts/#xsd-datatypes). On the other hand this can be extracted to a separate library and be just reused by everyone.
  3. Or should it just leave things as they are? (making it work like in the table in this issue description).
    • We may argue it's as closest as possible to the RDF spec cited in the previous point which we value more than programmers convenience brought by 1. or 2. (they are mutually exclusive).

Which options do you like the most @k00ni ?

k00ni commented 3 years ago

If I look from a store perspective, I favor the everything-is-a-string approach. Serializers may also benefit.

Regarding "what we think the Literal constructor should do":

Keeping bool/int/... values temporarily in an instance is good, but if you wanna persist them you may run into problems. I can't save bool in a SQLite table for instance.

We could extend getValue with a parameter like $castValue. getValue may return a string per default, but if $castValue is true it returns the value based on given $datatype. For instance bool if $datatype == xsd:boolean.

zozlak commented 3 years ago

I would only allow string and Stringable values

But in practice there is no difference between int/float/bool and Stringable... In all contexts which implicitly cast to string (e.g. concatenation with a string) both int/float/bool and Stringable produce a string. And for contexts without implicit cast to a string (e.g. comparison) you have to add explicit (string) cast both to int/float/bool and Stringable.

Does it mean you would prefer to only allow string to be passed as the value in the constructor?

We could extend getValue with a parameter like $castValue. getValue may return a string per default, but if $castValue is true it returns the value based on given $datatype.

Yes, this sound like a good idea to me.

zozlak commented 3 years ago

All in all I would propose to:

k00ni commented 3 years ago

Let me think about it. I will respond probably next week.

zozlak commented 3 years ago

Experimenting with the Dataset implementation I run into following trouble:

$l1 = DataFactory::literal(1);
$l2 = DataFactory::literal('1');

// we could expect (and it's easy to achieve it on the plain term level)
assert($l1->getValue(CAST_LEXICAL_FORM) === '1');
assert($l2->getValue(CAST_LEXICAL_FORM) === '1');
assert($l1->equals($l2));
assert($l1->getValue(CAST_NONE) === 1);
assert($l2->getValue(CAST_NONE) === '1');

// but things go complicated once we reach the Dataset
$d = new Dataset();
$d->add($l1);
$d->add($l2);
assert(count($d) === 1); // $l1 equals $l2 so they can't create separate edges
assert(current($d)->getValue(CAST_LEXICAL_FORM) === '1'); // pretty obviously
assert(current($d)->getValue(CAST_NONE) === ???); // either 1 or '1' - no one can tell

And once we a) don't know what to expect from CAST_NONE when a literal is inside a dataset b) datasets are the main use case (I don't expect literal to be widely used out of the dataset context), I don't see much sense in requiring implementations to provide this kind of value cast.

Another (admittedly very subjective) issue is that requesting implementations to be able to distinguish RDF-non-equal literals makes it impossible to implement immediate literals comparison with the help of singleton literals - an optimization I definitely wanted to use.

A solution to both these problems would be to:

Such a change doesn't affect Literal interface API but affects implementations but changing some tests.

Anyway it's an important design decision and it should be made carefully so please take your time (and try to experiment with various variants when you have doubts). We are not in a hurry.

k00ni commented 3 years ago

Maybe I didn't fully understand your points but it seems over engineered.


I created a branch called issue-15-all-values-strings and added basic implementations of our interfaces (a few from my in-memory store and DataSet from simpleRdf) to have something to test. It may act as a playing ground and we can discuss things using testable examples. The following test https://github.com/sweetrdf/rdfInterface/blob/feature/issue-15-all-values-strings/tests/PlaygroundTest.php contains (most of your) test-code from above.

When running this test (using vendor/bin/phpunit tests/) it fails with:

1) rdfInterface\tests\PlaygroundTest::test
Failed asserting that 2 matches expected 1.

/var/www/html/rdfInterface/tests/PlaygroundTest.php:32

At that line the number of quads in DataSet instance is checked:

$this->assertEquals(1, count($d));


So far so good. In my implementations only string and Stringable are allowed as Literal values. After testing a little I think we have to deal with issues on 2 levels:

  1. Quad level
  2. DataSet level

For (1): Literal values are saved as a string in PHP. User may provide a datatype. When receiving a value from a Literal instance one can decide if raw string is enough or a cast is required (string => bool). Latter is based on a previously given datatype URI (e.g. xsd:boolean). If none was given, it should return value as it is (=string). If I understand the specification correctly it should be lexical form approach using an Unicode string.

For (2): At first, your DataSet implementation is very complex and there are a lot of functions which I see rather optional and use-case dependent. As far as I can tell, you have an iterator implementation mixed with matching functions, map-reduce and Store-like functions. DataSet interface as it is right now forces this kind of implementation, but I am not sure if that is even needed.

When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===. This way $this->assertTrue(count($d) === 1); is solved IMHO.

Last two cases...

$this->assertTrue(current($d)->getObject()->getValue() === '1');
$this->assertTrue(current($d)->getObject()->getValue() === 1);

... can't be an issue if data set ignores one of the quads because it has it already. Or you have to provide datatypes too and require it when calling getValue like getValue(true) (to require a cast based on datatype).


I am not sure how to "solve" this issue. IMHO we should "solve" existing "issues" on triple/quad level first, like connecting different parser implementations for instance. Both generate a QuadIterator instance (how they do that is up to them) and we compare. Also parser from vendor A generates a QuadIterator and serializer from vendor B serializes it. Your quickRdf IO seems to be a good practice range for that.

zozlak commented 3 years ago

Ad 1 I agree on the getValue() behavior. But I would like to keep ability to pass PHP int/float/bool with automatic type recognition as described here. Btw the type autodetection code can be placed into RdfHelpers so people don't need to always copy-paste it (or write from scratch).

When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===. This way $this->assertTrue(count($d) === 1); is solved IMHO.

My proposal above also solves it, just in a little different way. As Literal(1) will be automatically promoted to Literal("1", null, "xsd:integer"), it will be clearly different from Literal("1") (being automatically promoted to Literal("1", null, "xsd:string)). Of course then we expect 2 in this test.

Summing up it seems we agree on the "literal value output and literal compare side" and we only need to agree if we want or not the "automatic int/bool/float promotion" behavior. I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.

Ad 2


Side issue - please don't include any implementation in this repository. Please just create a repository for your implementation in a same way it works for simpleRdf and quickRdf. SimpleRdf and quickRdf show also:

k00ni commented 3 years ago

Quick note: I added these implementations temporarily to have example code to test and discuss about, because simpleRdf depends on rdfInterface already. Because implementations depend on interfaces I had to change code in this repo too (to reflect parameters only being string and Stringable) not just implementation details. As far as I know that is only possible if you have all in one place (otherwise its to much overhead when doing it in two or more repositories).

zozlak commented 3 years ago

Just create your own branch and use https://getcomposer.org/doc/articles/versions.md#branches while specifying your implementation package dependency on the rdfInterface. That's what I do now while experimenting with quite major (but probably not affecting you much) changes in branch termCompare.

(btw, wow, composer 2 checks out dev-branchName immediately while in composer 1 it took minutes)

k00ni commented 3 years ago

I am fine with either way. My wish would be to have a simple way of data exchange between libraries (e.g. store, parser, serializer). That seems to be solved on quad level as of now. Parsers etc. are out of my reach currently (no active projects) and I may only contribute on a theoretical level.

You wrote:

I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.

Things like type detection etc. might be to detailed because it forces/assume certain implementation details. I have to check EasyRdf approach, but maybe you can make a rough PR which reflects your thoughts?

zozlak commented 3 years ago

By the way, if your library transforms something to Terms:

k00ni commented 3 years ago

It is useless to implement Terms (BlankNode/NamedNode/Literal/DefaultGraph/Quad) on your own.

At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected. I might switch to simpleRdf, but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?

EDIT: Just saw that quickRdf also has Term implementations (thought it was only for parsers etc.). Only looked in simpleRdf.

zozlak commented 3 years ago

but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?

Yes it is. But the beauty of rdfInterface is also you can reuse others work and everything stays compatible :-)

At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected.

Fair points. I hope everything should stabilize soon. I'm slowly running out of new ideas ;-)

Just saw that quickRdf also has Term implementations

quickRdf is my state of the art implementation of terms and the dataset. In fact my motivation to develop the simpleRdf was mainly to have another implementation of terms and dataset so I can check its interoperability with quickRdf :-) (quickRdf uses some tricks to speed things up and it was a good question if it will strill work once "foreign" Terms are dropped on it)

I hope I will shortly prepare a performance comparison with EasyRdf. Initial tests suggest it's significantly faster in simple filtering by subject/predicate/object and consumes significantly less memory.

zozlak commented 3 years ago

Side note - your implementation of the equals() method in all terms classes is wrong as it will return true only if the compared term is of the same class as your object (and has same property values) while you should be able to correctly compare against term from other rdfInterface implementations as well.

Long story short - when you implement an interface methods you must only rely on methods guaranteed by this interface and you can't depend on objects internal structure (as you have no impact on internal structure of other interface implementations).

Yet another reason not to implement terms by yourself ;-)