Open GoogleCodeExporter opened 8 years ago
This is a slightly tangled issue and this is something of a long-ish post,
sorry about that.
Initially concentrating on issue200:
"When rdflib is used in a application that fork the current Python process, for
exemple when using flup.server.*_fork, BNode's value generation in these
processes: share the same _prefix and use independant serial number generators
that start with the same value"
What's "wrong" about that?
1. It violated at least two users' expectations (the OP's and, I find, mine
too).
2. bnode identifiers themselves are outside the RDF spec and Gunnar is correct
to identify a documentation issue here - although some preparations have aleady
been made:
i) http://rdflib.readthedocs.org/en/latest/howto.html#merging-graphs
ii) http://rdfextras.readthedocs.org/en/latest/store/bnode_drama.html
iii) http://rdflib.readthedocs.org/en/latest/graphs_bnodes.html
the latter includes URLs for two relevant and highly illuminating posts by Pat
Hayes:
http://www.ihmc.us/users/phayes/RDFGraphSyntax.html
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0153.html
Drifting slightly away to the consequences of two graphs having the same bnode
id but for different statements...
Gunnar observes:
+ the current behaviour is also useful in many settings. If you have
+ an application that throws graphs around, you probably DO want
+ bnode IDs to be stable and remain the same in all graphs.
+ Also, what exactly is the identity of a graph, i.e. when would we
+ want to trigger the "make sure all bnode IDs are unique code". If
+ two graphs are part of the same ConjunctiveGraph they are probably
+ NOT different "enough".
Just for info, I'll observe that bnode ids in rdflib-parsed graphs are
"standardized apart" by default. So it's just the set-theoretic operators which
are characterisable as "naive", however ...
There is a solution which seems to tick all the boxes - at least for graphs
generated and serialized by rdflib.
The extant code [1] for generating bnode ids dates back to 2005, prior to the
introduction of the uuid module (in Python 2.5). Given that the extant code
attempts to generate a "(hopefully) unique prefix", we might usefully switch to
using uuid (and faking one for Python 2.4).
Using uuid.uuid4() to generate bnode ids would enormously reduce the
probability of bnode id collisions between (rdflib-generated) graphs and they
could be confidently processed by naive set-theoretic operators, no
Skolemization required.
There'd be an associated cost in terms of an increase in storage space
requirements but I feel that's worth the gain in robustness.
Returning to issue200:
Using bnode ids based on uuid.uuid4() would also obviate any necessity to go
mucking about with re-seeding the random seed in forked processes as it fixes
issue200 as a side-effect. This was demonstrated in the investigative tests
that I've been repeatedly committing, apologies for that. I discovered that
tests of my putative solution were all passing on 32-bit architecture m/cs but
was seeing some failures on 64-bit m/cs.
[1] http://code.google.com/p/rdflib/source/browse/rdflib/term.py#180
My vote is: yes it is a documentation issue but we could ameliorate some of the
practical issue and at the same time improve the codebase by replacing the
bnode id generating code by the uuid module from the standard library.
Cheers,
Graham
Original comment by gjhigg...@gmail.com
on 27 Jan 2012 at 8:21
Original issue reported on code.google.com by
gromgull
on 27 Jan 2012 at 9:34