Graph set-operations and bnodes

When discussing Issue 200 we came across this: 

In theory, bnode IDs are only valid inside a single graph. I.e. any merging of 
graphs or serialization/parsing / sparql-roundtripping should at least: 1. make 
sure all bnode IDs are unique. And maybe 2. canonalize the bnodes IDs and solve 
the fun graph-isomorphism problem. 

I.e. this would affect set-theoretic operations on graphs __add__, __iadd__, 
etc. 
(and probably other things. )

On the other hand - the current behaviour is also useful in many settings. If 
you have an application that throws graphs around, you probably DO want bnode 
IDs to be stable and remain the same in all graphs. 
Also, what exactly is the identity of a graph, i.e. when would we want to 
trigger the "make sure all bnode IDs are unique code". If two graphs are part 
of the same ConjunctiveGraph they are probably NOT different "enough". 

A semi-related issue is bnode IDs in SPARQL, where they did it "correctly" and 
bnode IDs are only valid inside a single a result-set, i.e. there is no way to 
do one query, get some bnodes back, then query for more information about them. 
(Although many proprietary extensions to SPARQL for this exist) 

I vote we change nothing - but document that issue in the doc-strings for 
__add__ etc, and add a warning that bnodes are handled "naively"

Original issue reported on code.google.com by gromgull on 27 Jan 2012 at 9:34

This is a slightly tangled issue and this is something of a long-ish post, 
sorry about that.

Initially concentrating on issue200:

"When rdflib is used in a application that fork the current Python process, for 
exemple when using flup.server.*_fork, BNode's value generation in these 
processes: share the same _prefix and use independant serial number generators 
that start with the same value"

What's "wrong" about that?

1. It violated at least two users' expectations (the OP's and, I find, mine 
too).

2. bnode identifiers themselves are outside the RDF spec and Gunnar is correct 
to identify a documentation issue here - although some preparations have aleady 
been made:

i) http://rdflib.readthedocs.org/en/latest/howto.html#merging-graphs
ii) http://rdfextras.readthedocs.org/en/latest/store/bnode_drama.html
iii) http://rdflib.readthedocs.org/en/latest/graphs_bnodes.html

the latter includes URLs for two relevant and highly illuminating posts by Pat 
Hayes:

 http://www.ihmc.us/users/phayes/RDFGraphSyntax.html
 http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0153.html

Drifting slightly away to the consequences of two graphs having the same bnode 
id but for different statements...

Gunnar observes:

+ the current behaviour is also useful in many settings. If you have 
+ an application that throws graphs around, you probably DO want 
+ bnode IDs to be stable and remain the same in all graphs. 
+ Also, what exactly is the identity of a graph, i.e. when would we 
+ want to trigger the "make sure all bnode IDs are unique code". If 
+ two graphs are part of the same ConjunctiveGraph they are probably 
+ NOT different "enough".

Just for info, I'll observe that bnode ids in rdflib-parsed graphs are 
"standardized apart" by default. So it's just the set-theoretic operators which 
are characterisable as "naive", however ...

There is a solution which seems to tick all the boxes - at least for graphs 
generated and serialized by rdflib. 

The extant code [1] for generating bnode ids dates back to 2005, prior to the 
introduction of the uuid module (in Python 2.5). Given that the extant code 
attempts to generate a "(hopefully) unique prefix", we might usefully switch to 
using uuid (and faking one for Python 2.4).

Using uuid.uuid4() to generate bnode ids would enormously reduce the 
probability of bnode id collisions between (rdflib-generated) graphs and they 
could be confidently processed by naive set-theoretic operators, no 
Skolemization required.

There'd be an associated cost in terms of an increase in storage space 
requirements but I feel that's worth the gain in robustness.

Returning to issue200:

Using bnode ids based on uuid.uuid4() would also obviate any necessity to go 
mucking about with re-seeding the random seed in forked processes as it fixes 
issue200 as a side-effect. This was demonstrated in the investigative tests 
that I've been repeatedly committing, apologies for that. I discovered that 
tests of my putative solution were all passing on 32-bit architecture m/cs but 
was seeing some failures on 64-bit m/cs.

[1] http://code.google.com/p/rdflib/source/browse/rdflib/term.py#180

My vote is: yes it is a documentation issue but we could ameliorate some of the 
practical issue and at the same time improve the codebase by replacing the 
bnode id generating code by the uuid module from the standard library.

Cheers,

Graham

Original comment by gjhigg...@gmail.com on 27 Jan 2012 at 8:21

walidazizi / rdflib

Graph set-operations and bnodes #209