pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
217 stars 23 forks source link

BNode scope when loading triples/quads into graph/dataset #1

Closed pchampin closed 4 years ago

pchampin commented 5 years ago

Problem

Currently, when inserting bnode identifies in a graph, the bnode identifier is kept as is.

For example, loading this file into a graph:

_:b1 <tag:p> "foo".

then, loading this file into the same graph:

_:b1 <tag:q> "bar".

will result in the following graph (in Turtle):

  [] <tag:p> "foo"; <tag:q> "bar".

while it should be

  [] <tag:p> "foo".
  [] <tag:q> "bar".

i.e. two different subjects, because the bnode identifiers in the two different files have two different scopes.

NB: it is important for the developer to be able to handle bnodes consistently, so at the lowest level (e.g. Graph::insert), the API should consider bnode identifiers as stable. But on the other hand, the default behaviour when loading a file should be the correct one.

Proposed solution

The methods TripleSource.in_graph and QuadSource.in_dataset are the preferred way of loading a stream of triples/quads (such as the one coming from a parser) into a graph/dataset.

The proposed solution is to change the semantics of these methods, and make them rename the bnodes they receive to avoid name-clashes with existing bnodes in the graph/dataset. Whether this should be done by generating UUIDs or inspecting the target graph/dataset for existing name, I'm not sure yet...

New methods in_graph_raw and in_dataset_raw (better name?) should probably be added, which would have the current semantics of in_graph and in_dataset.

pchampin commented 4 years ago

Proposed solution (revamped)

Thinking about it a little more, I think that the right place to fix this is in graph::Inserter and dataset::Inserter respectively.

There should actually be two implementations for each, RawInserter (current implementation) and ScopedInserter (which ensures that all blank nodes passed to the inserter will be created as fresh blank nodes, renaming them if needed). The methods Graph::inserter and Dataset::inserter should also be replaced accordingly by methods named raw_inserter and scoped_inserter.

The in_graph/in_dataset methods of sinks would use scoped inserters (as this is the most common use case). And I don't think (anymore) that a *_raw variant of these methods is required. If someone really wants to do that, they can still write:

    let inserted = my_source.in_sink(&mut my_graph.raw_inserter());
pchampin commented 4 years ago

Proposed solution (the revenge)

Now that TripleSink and QuadSink have disappeared, the previous proposed solution is moot.

Here are some design considerations:

As a consequence, here is the solution that I think is best;

This solution comes with an additional cost, though:

The first option induces an effort on all implementors of Graph/Dataset. The second option induces that effort only on those implementors who want to support collect_graph/collect_dataset, but users may in turn be surprised if not all Graph/Dataset implementations support that. I'm still not sure which one I prefer.

pchampin commented 4 years ago

Solved by d7cb8c3.