Open zozlak opened 3 years ago
@k00ni what do you think? Should we strictly follow RDF specification or not?
All in all there might be many solutions like:
Literal
constructor should recognize PHP scalar types and assign corresponding datatype automatically (making $l1 != $l2
and $l2 == $l3
and $l2 == $l4
.
Literal::getValue()
should map the value to the datatype set making $l4 == $l5
.Saying that all values are strings and their type is determine by $datatype
is clean.
So in a check-scenario we cast value (if necessary) and do a ===
?
Yes, Literal::equals()
, should (according to the RDF spec) look more or less like that:
public function equals(\rdfInterface\Term $term): bool {
return $term instanceof \rdfInterface\Literal &&
(string) $this->getValue() === (string) $term->getValue() &&
$this->getDatatype() === $term->getDatatype() &&
$this->getLang() === $term->getLang()
}
Anyway here the question is more about "what we think the Literal constructor should do"?
$l2 == $l3
in the example in the description.
$lang
and $datatype
parameters? (making e.g. $l4 == $l5
)
Which options do you like the most @k00ni ?
If I look from a store perspective, I favor the everything-is-a-string approach. Serializers may also benefit.
Regarding "what we think the Literal constructor should do":
string
and Stringable
values$datatype
ala xsd:integer
etc.Keeping bool/int/... values temporarily in an instance is good, but if you wanna persist them you may run into problems. I can't save bool in a SQLite table for instance.
We could extend getValue
with a parameter like $castValue
. getValue
may return a string per default, but if $castValue
is true it returns the value based on given $datatype
. For instance bool
if $datatype == xsd:boolean
.
I would only allow string and Stringable values
But in practice there is no difference between int/float/bool and Stringable... In all contexts which implicitly cast to string (e.g. concatenation with a string) both int/float/bool and Stringable produce a string. And for contexts without implicit cast to a string (e.g. comparison) you have to add explicit (string)
cast both to int/float/bool and Stringable.
Does it mean you would prefer to only allow string to be passed as the value in the constructor?
We could extend getValue with a parameter like $castValue. getValue may return a string per default, but if $castValue is true it returns the value based on given $datatype.
Yes, this sound like a good idea to me.
All in all I would propose to:
Literal
constructor as it is\
(just set the default $datatype to null because it's far more complex in practice - see the __construct()
docblock in https://github.com/sweetrdf/rdfInterface/blob/master/src/rdfInterface/Literal.php)Literal::CAST_LEXICAL_FORM = 1
, Literal::CAST_NONE = 2
, Literal::CAST_DATATYPE = 3
.Literal::getValue()
signature to getValue(int $cast = Literal::CAST_LEXICAL_FORM): mixed
. The return value should be:
CAST_LEXICAL_FORM
the lexical form of the literal as specified here. Which is to be used in all contexts where RDF compliance is needed (comparing literals, serialization, etc.)CAST_NONE
just a value passed to the constructor is returned which allows to preserve complex objects, e.g. a DateTime wrapper class (a wrapper class would be needed for PHP's DateTime as it isn't Stringable)CAST_DATATYPE
is up to the implementationLet me think about it. I will respond probably next week.
Experimenting with the Dataset implementation I run into following trouble:
$l1 = DataFactory::literal(1);
$l2 = DataFactory::literal('1');
// we could expect (and it's easy to achieve it on the plain term level)
assert($l1->getValue(CAST_LEXICAL_FORM) === '1');
assert($l2->getValue(CAST_LEXICAL_FORM) === '1');
assert($l1->equals($l2));
assert($l1->getValue(CAST_NONE) === 1);
assert($l2->getValue(CAST_NONE) === '1');
// but things go complicated once we reach the Dataset
$d = new Dataset();
$d->add($l1);
$d->add($l2);
assert(count($d) === 1); // $l1 equals $l2 so they can't create separate edges
assert(current($d)->getValue(CAST_LEXICAL_FORM) === '1'); // pretty obviously
assert(current($d)->getValue(CAST_NONE) === ???); // either 1 or '1' - no one can tell
And once we a) don't know what to expect from CAST_NONE when a literal is inside a dataset b) datasets are the main use case (I don't expect literal to be widely used out of the dataset context), I don't see much sense in requiring implementations to provide this kind of value cast.
Another (admittedly very subjective) issue is that requesting implementations to be able to distinguish RDF-non-equal literals makes it impossible to implement immediate literals comparison with the help of singleton literals - an optimization I definitely wanted to use.
A solution to both these problems would be to:
Literal
constructor to recognize PHP scalar types (int/float/bool) and set the datatype accordingly (until the datatype is not set explicitly; by the way it's what EasyRdf Resource::addLiteral()
does).Literal::CAST_DATATYPE
up to implementations (as it's described now).Literal::CAST_NONE
cast type as we don't have any clear idea about how it should work now. Btw I would leave the $cast
parameter as int allowing implementations to provide their own "cast types" if they want.Such a change doesn't affect Literal
interface API but affects implementations but changing some tests.
Anyway it's an important design decision and it should be made carefully so please take your time (and try to experiment with various variants when you have doubts). We are not in a hurry.
Maybe I didn't fully understand your points but it seems over engineered.
I created a branch called issue-15-all-values-strings and added basic implementations of our interfaces (a few from my in-memory store and DataSet
from simpleRdf) to have something to test. It may act as a playing ground and we can discuss things using testable examples. The following test https://github.com/sweetrdf/rdfInterface/blob/feature/issue-15-all-values-strings/tests/PlaygroundTest.php contains (most of your) test-code from above.
When running this test (using vendor/bin/phpunit tests/
) it fails with:
1) rdfInterface\tests\PlaygroundTest::test
Failed asserting that 2 matches expected 1.
/var/www/html/rdfInterface/tests/PlaygroundTest.php:32
At that line the number of quads in DataSet instance is checked:
$this->assertEquals(1, count($d));
So far so good. In my implementations only string and Stringable are allowed as Literal values. After testing a little I think we have to deal with issues on 2 levels:
For (1): Literal values are saved as a string in PHP. User may provide a datatype. When receiving a value from a Literal instance one can decide if raw string is enough or a cast is required (string => bool). Latter is based on a previously given datatype URI (e.g. xsd:boolean
). If none was given, it should return value as it is (=string). If I understand the specification correctly it should be lexical form approach using an Unicode string.
For (2): At first, your DataSet implementation is very complex and there are a lot of functions which I see rather optional and use-case dependent. As far as I can tell, you have an iterator implementation mixed with matching functions, map-reduce and Store-like functions. DataSet interface as it is right now forces this kind of implementation, but I am not sure if that is even needed.
When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===
. This way $this->assertTrue(count($d) === 1);
is solved IMHO.
Last two cases...
$this->assertTrue(current($d)->getObject()->getValue() === '1');
$this->assertTrue(current($d)->getObject()->getValue() === 1);
... can't be an issue if data set ignores one of the quads because it has it already. Or you have to provide datatypes too and require it when calling getValue
like getValue(true)
(to require a cast based on datatype).
I am not sure how to "solve" this issue. IMHO we should "solve" existing "issues" on triple/quad level first, like connecting different parser implementations for instance. Both generate a QuadIterator instance (how they do that is up to them) and we compare. Also parser from vendor A generates a QuadIterator and serializer from vendor B serializes it. Your quickRdf IO seems to be a good practice range for that.
Ad 1 I agree on the getValue()
behavior. But I would like to keep ability to pass PHP int/float/bool with automatic type recognition as described here. Btw the type autodetection code can be placed into RdfHelpers so people don't need to always copy-paste it (or write from scratch).
When following the Literal-values-are-strings: A DataSet implementation should consider '1' and 1 as equal, if no type is given (because both will be casted to string). If types are given, it should cast them based on type and compare by ===. This way $this->assertTrue(count($d) === 1); is solved IMHO.
My proposal above also solves it, just in a little different way. As Literal(1)
will be automatically promoted to Literal("1", null, "xsd:integer")
, it will be clearly different from Literal("1")
(being automatically promoted to Literal("1", null, "xsd:string)
). Of course then we expect 2
in this test.
Summing up it seems we agree on the "literal value output and literal compare side" and we only need to agree if we want or not the "automatic int/bool/float promotion" behavior. I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.
Ad 2
QuadIterator
.DatasetCompare
, DatasetMapReduce
) making them optional and we can consider extracting a few other to separate interface(s). Please propose somethingSide issue - please don't include any implementation in this repository. Please just create a repository for your implementation in a same way it works for simpleRdf and quickRdf. SimpleRdf and quickRdf show also:
DefaultGraph
implementation provided by rdfHelpers).Quick note: I added these implementations temporarily to have example code to test and discuss about, because simpleRdf depends on rdfInterface already. Because implementations depend on interfaces I had to change code in this repo too (to reflect parameters only being string and Stringable) not just implementation details. As far as I know that is only possible if you have all in one place (otherwise its to much overhead when doing it in two or more repositories).
Just create your own branch and use https://getcomposer.org/doc/articles/versions.md#branches while specifying your implementation package dependency on the rdfInterface. That's what I do now while experimenting with quite major (but probably not affecting you much) changes in branch termCompare.
(btw, wow, composer 2 checks out dev-branchName
immediately while in composer 1 it took minutes)
I am fine with either way. My wish would be to have a simple way of data exchange between libraries (e.g. store, parser, serializer). That seems to be solved on quad level as of now. Parsers etc. are out of my reach currently (no active projects) and I may only contribute on a theoretical level.
You wrote:
I would like to keep it as a) I like the way it works in EasyRdf b) I expect automatic type detection to pop up sooner or later for Stringables (especially for date/time ones) and having autodetection for scalars would keep it consistent.
Things like type detection etc. might be to detailed because it forces/assume certain implementation details. I have to check EasyRdf approach, but maybe you can make a rough PR which reflects your thoughts?
By the way, if your library transforms something to Terms:
It would be nice if you don't enforce a particular Terms implementation but let the user choose (see quickRdfIo). Of course you can still provide a default implementation (see example below; of course it's fake but I hope it's clear).
use rdfInterface\DataFactory;
class mySparqlExecutor {
private DataFactory $dataFactory;
public function __construct(DataFactory $dataFactory = null) {
$this->dataFactory = $dataFactory ?? new \simpleRdf\DataFactory();
}
public function executeSparql(string $query) {
$rawResults = (... gather them somehow ...);
foreach ($rawResults as $row) {
foreach ($row as $var) {
if ($var['type'] === 'literal') {
$term = $this->dataFactory::literal($var['value'], $var['lang'], $var['datatype']);
} (...)
}
}
(...)
}
}
It is useless to implement Terms (BlankNode/NamedNode/Literal/DefaultGraph/Quad) on your own.
At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected. I might switch to simpleRdf, but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?
EDIT: Just saw that quickRdf also has Term implementations (thought it was only for parsers etc.). Only looked in simpleRdf.
but isn't it the beauty of rdfInterface that everyone can have its own implementation but is still compatible to each other?
Yes it is. But the beauty of rdfInterface is also you can reuse others work and everything stays compatible :-)
At the time I needed these Term implementations there wasn't a release of simpleRdf available. I didn't know if you will made drastic changes at that time so I decided to implement it myself. Also saw it as a good practice to see if our interfaces work as expected.
Fair points. I hope everything should stabilize soon. I'm slowly running out of new ideas ;-)
Just saw that quickRdf also has Term implementations
quickRdf is my state of the art implementation of terms and the dataset. In fact my motivation to develop the simpleRdf was mainly to have another implementation of terms and dataset so I can check its interoperability with quickRdf :-) (quickRdf uses some tricks to speed things up and it was a good question if it will strill work once "foreign" Terms are dropped on it)
I hope I will shortly prepare a performance comparison with EasyRdf. Initial tests suggest it's significantly faster in simple filtering by subject/predicate/object and consumes significantly less memory.
Side note - your implementation of the equals()
method in all terms classes is wrong as it will return true
only if the compared term is of the same class as your object (and has same property values) while you should be able to correctly compare against term from other rdfInterface implementations as well.
Long story short - when you implement an interface methods you must only rely on methods guaranteed by this interface and you can't depend on objects internal structure (as you have no impact on internal structure of other interface implementations).
Yet another reason not to implement terms by yourself ;-)
We should decide and include in the test suite tests for:
From the RDF specification perspective (https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal):
xsd:boolean
and lexical form of1
has value oftrue
)http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
http://www.w3.org/2001/XMLSchema#string
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
have a lang tag, literals of other type don't"1"^^xsd:int
and"01"^^xsd:int
aren't equal in RDF terms.According to that (language tags skipped as they are always missing here):
$l1 == $l2
"1"
"1"
xsd:string
xsd:string
true
$l2 == $l3
"1"
"1"
xsd:string
xsd:int
false
$l3 == $l4
"1"
"1"
xsd:int
xsd:int
true
$l3 == $l5
"1"
"1"
xsd:int
xsd:int
true
$l4 == $l5
"1"
"01"
xsd:int
xsd:int
false