linkeddata / rdflib.js

Linked Data API for JavaScript
http://linkeddata.github.io/rdflib.js/doc/
Other
566 stars 146 forks source link

Merging graphs with blank nodes #405

Open th1j5 opened 4 years ago

th1j5 commented 4 years ago

I'm using rdflib in a project to update an existing graph with new data. This new data contains a bit overhead, because it is possible a certain 'device1' already exists, but when merging with the old graph they have the same IRI's and thus mean the same. We are using blank nodes to represent different measurements. Old graph:

@prefix ns0: <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://basisLeshan.com/device1/3303/0>
  a ns0:ObjectInstance;
  ns0:consistsOf [
    a ns0:5700, ns0:ResourceInstance ;
    ns0:hasTimeStamp "2020-04-16T08:24:03.755Z"^^xsd:dateTime ;
    ns0:hasValue "-2.5"^^xsd:float ;
    ns0:organizedInto <http://basisLeshan.com/device1/3303/0>
  ] ; 
  ns0:containedBy <http://basisLeshan.com/device1> .

<http://basisLeshan.com/device1>
  a ns0:Device ;
  ns0:contains <http://basisLeshan.com/device1/3303/0> .

new Graph:

@prefix ns0: <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://basisLeshan.com/device1/3303/0>
  a ns0:ObjectInstance;
  ns0:consistsOf [
    a ns0:5700, ns0:ResourceInstance ;
    ns0:hasTimeStamp "2020-04-16T08:26:32.988Z"^^xsd:dateTime ;
    ns0:hasValue "-5.1"^^xsd:float ;
    ns0:organizedInto <http://basisLeshan.com/device1/3303/0>
  ] ; 
  ns0:containedBy <http://basisLeshan.com/device1> .

<http://basisLeshan.com/device1>
  a ns0:Device ;
  ns0:contains <http://basisLeshan.com/device1/3303/0> .

(The only difference is a different value & timestamp)

When I would merge them, I would expect these 2 blank nodes to kept seperate, because there is nothing that could suggest otherwise. This is also the behavior of rdflib in python.

Thus:

#!/usr/bin/env python3                                                          
from rdflib import Graph                                                        
# https://rdflib.readthedocs.io/en/stable/merging.html                          

g = Graph()                                                                     

g.parse('old_graph.ttl', format='turtle')                         
g.parse('new_graph.ttl', format='turtle')                         
g.serialize('out.ttl', format='turtle') 

gives:

@prefix ns0: <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://basisLeshan.com/device1> a ns0:Device ;
    ns0:contains <http://basisLeshan.com/device1/3303/0> .

<http://basisLeshan.com/device1/3303/0> a ns0:ObjectInstance ;
    ns0:consistsOf [ a <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#5700>,
                ns0:ResourceInstance ;
            ns0:hasTimeStamp "2020-04-16T08:24:03.755000+00:00"^^xsd:dateTime ;
            ns0:hasValue "-2.5"^^xsd:float ;
            ns0:organizedInto <http://basisLeshan.com/device1/3303/0> ],
        [ a <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#5700>,
                ns0:ResourceInstance ;
            ns0:hasTimeStamp "2020-04-16T08:26:32.988000+00:00"^^xsd:dateTime ;
            ns0:hasValue "-5.1"^^xsd:float ;
            ns0:organizedInto <http://basisLeshan.com/device1/3303/0> ] ; 
    ns0:containedBy <http://basisLeshan.com/device1> .

With 2 separate blank nodes, like expected, also according to the specs (if I understand them correct)

Implementations that handle blank node identifiers in concrete syntaxes need to be careful not to create the same blank node from multiple occurrences of the same blank node identifier except in situations where this is supported by the syntax.

When I use rdflib.js however, these blank nodes get mangled into 1:

const $rdf = require('rdflib');
const store = $rdf.graph();
$rdf.parse(old_graph, store, 'https://www.example.com/', 'text/turtle');
$rdf.parse(new_graph, store, 'https://www.example.com/', 'text/turtle');
console.log($rdf.serialize(null, store, 'https://www.example.com/', 'text/turtle'));

Like you see here:

@prefix : <#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix b: <http://basisLeshan.com/>.
@prefix om: <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#>.
@prefix n0: <http://basisLeshan.com/device1/3303/>.

b:thijs-Galago-Pro
a om:Device; om:contains n0:0.
n0:0
    a om:ObjectInstance;
    om:consistsOf
            [
                a om:5700, om:ResourceInstance;
                om:hasTimeStamp
                    "2020-04-16T08:24:03.755Z"^^xsd:dateTime,
                    "2020-04-16T08:26:32.988Z"^^xsd:dateTime;
                om:hasValue "-2.5"^^xsd:float, "-5.1"^^xsd:float;
                om:organizedInto n0:0
            ];
    om:containedBy b:device1.

This would (to my understanding and expectations) not be according the RDF specs? So, is this a bug or is there another way this should be done? If there is anything that can clarify my question, ask me!

th1j5 commented 4 years ago

Maybe worse: if I say the old and new graph come from different documents (see code)

const $rdf = require('rdflib');
const store = $rdf.graph();
$rdf.parse(old_graph, store, 'https://www.example.com/1', 'text/turtle');
$rdf.parse(new_graph, store, 'https://www.example.com/2', 'text/turtle');
console.log($rdf.serialize(null, store, 'https://www.example.com/', 'text/turtle'));

I get this:

@prefix : </#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix b: <http://basisLeshan.com/>.
@prefix om: <https://florsanders.inrupt.net/public/ontologies/omalwm2m.ttl#>.
@prefix n0: <http://basisLeshan.com/device1/3303/>.

b:device1
    a om:Device, om:Device;
    om:contains n0:0, n0:0.
n0:0
    a om:ObjectInstance, om:ObjectInstance;
    om:consistsOf _:_g_L5C354, _:_g_L5C354;
    om:containedBy b:device1, b:device1.
_:_g_L5C354
    a om:5700, om:5700, om:ResourceInstance, om:ResourceInstance;
    om:hasTimeStamp
        "2020-04-16T08:24:03.755Z"^^xsd:dateTime,
        "2020-04-16T08:26:32.988Z"^^xsd:dateTime;
    om:hasValue "-2.5"^^xsd:float, "-5.1"^^xsd:float;
    om:organizedInto n0:0, n0:0.

This means that: all triples are doubled (see for ex. b:device1 a om:Device, om:Device;)

But the blank node is still merged!! I think this is not how it should be?

RinkeHoekstra commented 2 years ago

I'm hitting the same problem. What is the best way to work around this?

RinkeHoekstra commented 2 years ago

This updating in BlankNode appears to be the culprit:

https://github.com/linkeddata/rdflib.js/blob/c14dfd57d5159ad5ac1ee2523cc7924968e24f53/src/blank-node.ts#L35

I think that the abstract nextId counter is not synchronised across class instances when data is loaded asynchronously?