Expected behavior of non-string literal comparison

sweetrdf / rdfInterface

MIT License

7 stars 2 forks source link

Expected behavior of non-string literal comparison #15

Open zozlak opened 3 years ago

zozlak commented 3 years ago

We should decide and include in the test suite tests for:

$l1 = DataFactory::literal('1');
$l2 = DataFactory::literal(1);
$l3 = DataFactory::literal('1', null, 'http://www.w3.org/2001/XMLSchema#int');
$l4 = DataFactory::literal(1, null, 'http://www.w3.org/2001/XMLSchema#int');
$l5 = DataFactory::literal('01', null, 'http://www.w3.org/2001/XMLSchema#int');

$l1 == $l2; // true or false?
$l2 == $l3; // true or false?
$l3 == $l4; // true or false?
$l3 == $l5; // true or false?
$l4 == $l5; // true or false?

From the RDF specification perspective (https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal):

Literal values are always strings no matter their datatype
- There is a distinction between a lexical form - the string value of a literal - and a value - a lexical form mapped to the datatype value space (e.g. a literal with datatype xsd:boolean and lexical form of 1 has value of true)
All literals have data type
- If they have a lang tag, it's implicitly http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
- If they don't have a lang tag and it's not specified explicitly, it's implicitly http://www.w3.org/2001/XMLSchema#string
Literal of type http://www.w3.org/1999/02/22-rdf-syntax-ns#langString have a lang tag, literals of other type don't
Literals are equal only if their lang tags, datatypes and lexical forms match. So pretty counterintuitively for a PHP programmer "1"^^xsd:int and "01"^^xsd:int aren't equal in RDF terms.

According to that (language tags skipped as they are always missing here):

comparison	left term lexical form	right term lexical form	left term datatype	right term datatype	result
`$l1 == $l2`	`"1"`	`"1"`	`xsd:string`	`xsd:string`	`true`
`$l2 == $l3`	`"1"`	`"1"`	`xsd:string`	`xsd:int`	`false`
`$l3 == $l4`	`"1"`	`"1"`	`xsd:int`	`xsd:int`	`true`
`$l3 == $l5`	`"1"`	`"1"`	`xsd:int`	`xsd:int`	`true`
`$l4 == $l5`	`"1"`	`"01"`	`xsd:int`	`xsd:int`	`false`

zozlak commented 3 years ago

@k00ni what do you think? Should we strictly follow RDF specification or not?

All in all there might be many solutions like:

We assume Literal constructor should recognize PHP scalar types and assign corresponding datatype automatically (making $l1 != $l2 and $l2 == $l3 and $l2 == $l4.
- It's worth noting that such a feature is built into RDF-turtle serialization - see https://www.w3.org/TR/turtle/#abbrev. In RDF-turtle it's defined for xsd:int, xsd:decimal and xsd:double. In PHP we can easily recognize int=>xsd:int, float=>xsd:double/xsd:decimal, bool=>xsd:boolean
We assume Literal::getValue() should map the value to the datatype set making $l4 == $l5.

k00ni commented 3 years ago

Saying that all values are strings and their type is determine by $datatype is clean.

So in a check-scenario we cast value (if necessary) and do a ===?

zozlak commented 3 years ago

Yes, Literal::equals(), should (according to the RDF spec) look more or less like that:

public function equals(\rdfInterface\Term $term): bool {
   return $term instanceof \rdfInterface\Literal && 
        (string) $this->getValue() === (string) $term->getValue() && 
        $this->getDatatype() === $term->getDatatype() &&
        $this->getLang() === $term->getLang()
}

Anyway here the question is more about "what we think the Literal constructor should do"?

Should it try to set datatype depending on the value's PHP type? (like in RDF-turtle for some numeric data types) And if so, which PHP types should be recognized? This would e.g. make $l2 == $l3 in the example in the description.
- We may see it as convenient for PHP programmers but we can also see it as overengineering .
Or should it cast the value to the type (possibly implicitly) determined by $lang and $datatype parameters? (making e.g. $l4 == $l5)
- We may see it intuitive for programmers and therefore desired but we can also argue it breaks point 2.b of this part of the RDF spec "Otherwise, the literal is ill-typed and no literal value can be associated with the literal. Such a case produces a semantic inconsistency but is not syntactically ill-formed. Implementations MUST accept ill-typed literals and produce RDF graphs from them. Implementations MAY produce warnings when encountering ill-typed literals."
- Requesting it would make rdfInterface implementation more troublesome as while the list of allowed RDF literal types isn't very long, it's definitely not short (see https://www.w3.org/TR/rdf11-concepts/#xsd-datatypes). On the other hand this can be extracted to a separate library and be just reused by everyone.
Or should it just leave things as they are? (making it work like in the table in this issue description).
- We may argue it's as closest as possible to the RDF spec cited in the previous point which we value more than programmers convenience brought by 1. or 2. (they are mutually exclusive).

Which options do you like the most @k00ni ?

k00ni commented 3 years ago

If I look from a store perspective, I favor the everything-is-a-string approach. Serializers may also benefit.

Regarding "what we think the Literal constructor should do":

I would only allow string and Stringable values
bool, int, float have to be coupled with an appropriate $datatype ala xsd:integer etc.

Keeping bool/int/... values temporarily in an instance is good, but if you wanna persist them you may run into problems. I can't save bool in a SQLite table for instance.

We could extend getValue with a parameter like $castValue. getValue may return a string per default, but if $castValue is true it returns the value based on given $datatype. For instance bool if $datatype == xsd:boolean.

zozlak commented 3 years ago

I would only allow string and Stringable values

But in practice there is no difference between int/float/bool and Stringable... In all contexts which implicitly cast to string (e.g. concatenation with a string) both int/float/bool and Stringable produce a string. And for contexts without implicit cast to a string (e.g. comparison) you have to add explicit (string) cast both to int/float/bool and Stringable.

Does it mean you would prefer to only allow string to be passed as the value in the constructor?

We could extend getValue with a parameter like $castValue. getValue may return a string per default, but if $castValue is true it returns the value based on given $datatype.

Yes, this sound like a good idea to me.

zozlak commented 3 years ago

All in all I would propose to:

leave the Literal constructor as it is\ (just set the default $datatype to null because it's far more complex in practice - see the __construct() docblock in https://github.com/sweetrdf/rdfInterface/blob/master/src/rdfInterface/Literal.php)
Introduce three constants: Literal::CAST_LEXICAL_FORM = 1, Literal::CAST_NONE = 2, Literal::CAST_DATATYPE = 3.
Change Literal::getValue() signature to getValue(int $cast = Literal::CAST_LEXICAL_FORM): mixed. The return value should be:
- for CAST_LEXICAL_FORM the lexical form of the literal as specified here. Which is to be used in all contexts where RDF compliance is needed (comparing literals, serialization, etc.)
- for CAST_NONE just a value passed to the constructor is returned which allows to preserve complex objects, e.g. a DateTime wrapper class (a wrapper class would be needed for PHP's DateTime as it isn't Stringable)
- what is returned for CAST_DATATYPE is up to the implementation

k00ni commented 3 years ago

Let me think about it. I will respond probably next week.

zozlak commented 3 years ago

Experimenting with the Dataset implementation I run into following trouble:

$l1 = DataFactory::literal(1);
$l2 = DataFactory::literal('1');

// we could expect (and it's easy to achieve it on the plain term level)
assert($l1->getValue(CAST_LEXICAL_FORM) === '1');
assert($l2->getValue(CAST_LEXICAL_FORM) === '1');
assert($l1->equals($l2));
assert($l1->getValue(CAST_NONE) === 1);
assert($l2->getValue(CAST_NONE) === '1');

// but things go complicated once we reach the Dataset
$d = new Dataset();
$d->add($l1);
$d->add($l2);
assert(count($d) === 1); // $l1 equals $l2 so they can't create separate edges
assert(current($d)->getValue(CAST_LEXICAL_FORM) === '1'); // pretty obviously
assert(current($d)->getValue(CAST_NONE) === ???); // either 1 or '1' - no one can tell

And once we a) don't know what to expect from CAST_NONE when a literal is inside a dataset b) datasets are the main use case (I don't expect literal to be widely used out of the dataset context), I don't see much sense in requiring implementations to provide this kind of value cast.

Another (admittedly very subjective) issue is that requesting implementations to be able to distinguish RDF-non-equal literals makes it impossible to implement immediate literals comparison with the help of singleton literals - an optimization I definitely wanted to use.

A solution to both these problems would be to:

request the Literal constructor to recognize PHP scalar types (int/float/bool) and set the datatype accordingly (until the datatype is not set explicitly; by the way it's what EasyRdf Resource::addLiteral() does).
Leave the Literal::CAST_DATATYPE up to implementations (as it's described now).
Drop the Literal::CAST_NONE cast type as we don't have any clear idea about how it should work now. Btw I would leave the $cast parameter as int allowing implementations to provide their own "cast types" if they want.

Such a change doesn't affect Literal interface API but affects implementations but changing some tests.

Anyway it's an important design decision and it should be made carefully so please take your time (and try to experiment with various variants when you have doubts). We are not in a hurry.

k00ni commented 3 years ago

Maybe I didn't fully understand your points but it seems over engineered.

I created a branch called issue-15-all-values-strings and added basic implementations of our interfaces (a few from my in-memory store and DataSet from simpleRdf) to have something to test. It may act as a playing ground and we can discuss things using testable examples. The following test https://github.com/sweetrdf/rdfInterface/blob/feature/issue-15-all-values-strings/tests/PlaygroundTest.php contains (most of your) test-code from above.

When running this test (using vendor/bin/phpunit tests/) it fails with:

1) rdfInterface\tests\PlaygroundTest::test
Failed asserting that 2 matches expected 1.

/var/www/html/rdfInterface/tests/PlaygroundTest.php:32

At that line the number of quads in DataSet instance is checked:

$this->assertEquals(1, count($d));

So far so good. In my implementations only string and Stringable are allowed as Literal values. After testing a little I think we have to deal with issues on 2 levels:

Quad level
DataSet level

For (1): Literal values are saved as a string in PHP. User may provide a datatype. When receiving a value from a Literal instance one can decide if raw string is enough or a cast is required (string => bool). Latter is based on a previously given datatype URI (e.g. xsd:boolean). If none was given, it should return value as it is (=string). If I understand the specification correctly it should be lexical form approach using an Unicode string.

For (2): At first, your DataSet implementation is very complex and there are a lot of functions which I see rather optional and use-case dependent. As far as I can tell, you have an iterator implementation mixed with matching functions, map-reduce and Store-like functions. DataSet interface as it is right now forces this kind of implementation, but I am not sure if that is even needed.

When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===. This way $this->assertTrue(count($d) === 1); is solved IMHO.

Last two cases...

$this->assertTrue(current($d)->getObject()->getValue() === '1');
$this->assertTrue(current($d)->getObject()->getValue() === 1);

... can't be an issue if data set ignores one of the quads because it has it already. Or you have to provide datatypes too and require it when calling getValue like getValue(true) (to require a cast based on datatype).

I am not sure how to "solve" this issue. IMHO we should "solve" existing "issues" on triple/quad level first, like connecting different parser implementations for instance. Both generate a QuadIterator instance (how they do that is up to them) and we compare. Also parser from vendor A generates a QuadIterator and serializer from vendor B serializes it. Your quickRdf IO seems to be a good practice range for that.

zozlak commented 3 years ago

Ad 1 I agree on the getValue() behavior. But I would like to keep ability to pass PHP int/float/bool with automatic type recognition as described here. Btw the type autodetection code can be placed into RdfHelpers so people don't need to always copy-paste it (or write from scratch).

When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===. This way $this->assertTrue(count($d) === 1); is solved IMHO.

My proposal above also solves it, just in a little different way. As Literal(1) will be automatically promoted to Literal("1", null, "xsd:integer"), it will be clearly different from Literal("1") (being automatically promoted to Literal("1", null, "xsd:string)). Of course then we expect 2 in this test.

Summing up it seems we agree on the "literal value output and literal compare side" and we only need to agree if we want or not the "automatic int/bool/float promotion" behavior. I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.

Ad 2

Yes, it is complex but:
- All in all I would consider it's API simpler comparing to e.g. EasyRdf Graph (while allowing much more operations)
- If someone needs just an iterable over Quads, there is QuadIterator.
- Some methods already have separate interfaces (DatasetCompare, DatasetMapReduce) making them optional and we can consider extracting a few other to separate interface(s). Please propose something
- Last but not least you don't have to implement a Dataset on your own. The beauty of having a common interface is you can just reuse a Dataset from an existing implementation (like simpleRdf or quickRdf). And just to be sure - not by copy-pasting :-)
- I think it's a good topic for a separate issue.

Side issue - please don't include any implementation in this repository. Please just create a repository for your implementation in a same way it works for simpleRdf and quickRdf. SimpleRdf and quickRdf show also:

How to reuse tests from rdfInterface without copy-pasting.
How to reuse implementation of a given term from an external library without copy-pasting (e.g. they use DefaultGraph implementation provided by rdfHelpers).

k00ni commented 3 years ago

Quick note: I added these implementations temporarily to have example code to test and discuss about, because simpleRdf depends on rdfInterface already. Because implementations depend on interfaces I had to change code in this repo too (to reflect parameters only being string and Stringable) not just implementation details. As far as I know that is only possible if you have all in one place (otherwise its to much overhead when doing it in two or more repositories).

zozlak commented 3 years ago

Just create your own branch and use https://getcomposer.org/doc/articles/versions.md#branches while specifying your implementation package dependency on the rdfInterface. That's what I do now while experimenting with quite major (but probably not affecting you much) changes in branch termCompare.

(btw, wow, composer 2 checks out dev-branchName immediately while in composer 1 it took minutes)

k00ni commented 3 years ago

I am fine with either way. My wish would be to have a simple way of data exchange between libraries (e.g. store, parser, serializer). That seems to be solved on quad level as of now. Parsers etc. are out of my reach currently (no active projects) and I may only contribute on a theoretical level.

You wrote:

I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.

Things like type detection etc. might be to detailed because it forces/assume certain implementation details. I have to check EasyRdf approach, but maybe you can make a rough PR which reflects your thoughts?

zozlak commented 3 years ago

By the way, if your library transforms something to Terms:

It is useless to implement Terms (BlankNode/NamedNode/Literal/DefaultfGraph/Quad) on your own. We already have simpleRdf and quickRdf for that.

It would be nice if you don't enforce a particular Terms implementation but let the user choose (see quickRdfIo). Of course you can still provide a default implementation (see example below; of course it's fake but I hope it's clear).

use rdfInterface\DataFactory;

class mySparqlExecutor {
  private DataFactory $dataFactory;

  public function __construct(DataFactory $dataFactory = null) {
      $this->dataFactory = $dataFactory ?? new \simpleRdf\DataFactory();
  }

  public function executeSparql(string $query) {
      $rawResults = (... gather them somehow ...);
      foreach ($rawResults as $row) {
          foreach ($row as $var) {
              if ($var['type'] === 'literal') {
                 $term = $this->dataFactory::literal($var['value'], $var['lang'], $var['datatype']);
              } (...)
          }
      }
      (...)
  }
}

k00ni commented 3 years ago

It is useless to implement Terms (BlankNode/NamedNode/Literal/DefaultGraph/Quad) on your own.

At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected. I might switch to simpleRdf, but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?

EDIT: Just saw that quickRdf also has Term implementations (thought it was only for parsers etc.). Only looked in simpleRdf.

zozlak commented 3 years ago

but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?

Yes it is. But the beauty of rdfInterface is also you can reuse others work and everything stays compatible :-)

At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected.

Fair points. I hope everything should stabilize soon. I'm slowly running out of new ideas ;-)

Just saw that quickRdf also has Term implementations

quickRdf is my state of the art implementation of terms and the dataset. In fact my motivation to develop the simpleRdf was mainly to have another implementation of terms and dataset so I can check its interoperability with quickRdf :-) (quickRdf uses some tricks to speed things up and it was a good question if it will strill work once "foreign" Terms are dropped on it)

I hope I will shortly prepare a performance comparison with EasyRdf. Initial tests suggest it's significantly faster in simple filtering by subject/predicate/object and consumes significantly less memory.

zozlak commented 3 years ago

Side note - your implementation of the equals() method in all terms classes is wrong as it will return true only if the compared term is of the same class as your object (and has same property values) while you should be able to correctly compare against term from other rdfInterface implementations as well.

Long story short - when you implement an interface methods you must only rely on methods guaranteed by this interface and you can't depend on objects internal structure (as you have no impact on internal structure of other interface implementations).

Yet another reason not to implement terms by yourself ;-)