pchampin commented 5 years ago

Problem

Currently, when inserting bnode identifies in a graph, the bnode identifier is kept as is.

For example, loading this file into a graph:

_:b1 <tag:p> "foo".

then, loading this file into the same graph:

_:b1 <tag:q> "bar".

will result in the following graph (in Turtle):

  [] <tag:p> "foo"; <tag:q> "bar".

while it should be

  [] <tag:p> "foo".
  [] <tag:q> "bar".

i.e. two different subjects, because the bnode identifiers in the two different files have two different scopes.

NB: it is important for the developer to be able to handle bnodes consistently, so at the lowest level (e.g. Graph::insert), the API should consider bnode identifiers as stable. But on the other hand, the default behaviour when loading a file should be the correct one.

Proposed solution

The methods TripleSource.in_graph and QuadSource.in_dataset are the preferred way of loading a stream of triples/quads (such as the one coming from a parser) into a graph/dataset.

The proposed solution is to change the semantics of these methods, and make them rename the bnodes they receive to avoid name-clashes with existing bnodes in the graph/dataset. Whether this should be done by generating UUIDs or inspecting the target graph/dataset for existing name, I'm not sure yet...

New methods in_graph_raw and in_dataset_raw (better name?) should probably be added, which would have the current semantics of in_graph and in_dataset.

pchampin commented 4 years ago

Proposed solution (revamped)

Thinking about it a little more, I think that the right place to fix this is in graph::Inserter and dataset::Inserter respectively.

There should actually be two implementations for each, RawInserter (current implementation) and ScopedInserter (which ensures that all blank nodes passed to the inserter will be created as fresh blank nodes, renaming them if needed). The methods Graph::inserter and Dataset::inserter should also be replaced accordingly by methods named raw_inserter and scoped_inserter.

The in_graph/in_dataset methods of sinks would use scoped inserters (as this is the most common use case). And I don't think (anymore) that a *_raw variant of these methods is required. If someone really wants to do that, they can still write:

    let inserted = my_source.in_sink(&mut my_graph.raw_inserter());

pchampin commented 4 years ago

Proposed solution (the revenge)

Now that TripleSink and QuadSink have disappeared, the previous proposed solution is moot.

Here are some design considerations:

Following the principle of least surprise, Graph::insert_all and Dataset::insert_all should insert the triples as is.
Following the same principle, the most obvious (and documented) way of loading a graph from a file should "do the right thing", i.e. not conflate bnodes from different sources if any.
That being said, the most common use case for loading triples from a file is to create a fresh graph containing these triples, rather than adding them to an existing graph possibly already containing triples...

As a consequence, here is the solution that I think is best;

remove the methods TripleSource::in_graph and QuadSource::in_dataset;
replace them by methods TripleSource::collect_graph<G> and QuadSource::collect_dataset<D>. These new methods are idiomatic (they mimic the Iterator::collect<T> method).

This solution comes with an additional cost, though:

either the Graph and Dataset trait have to specify a way to construct and empty Graph/Dataset,
or the G/D parameters require an additional trait bound (Default? a new FromTripleSource trait) in addition to Graph/Dataset...

The first option induces an effort on all implementors of Graph/Dataset. The second option induces that effort only on those implementors who want to support collect_graph/collect_dataset, but users may in turn be surprised if not all Graph/Dataset implementations support that. I'm still not sure which one I prefer.

pchampin commented 4 years ago

Solved by d7cb8c3.

pchampin / sophia_rs

BNode scope when loading triples/quads into graph/dataset #1

Problem

Proposed solution

Proposed solution (revamped)

Proposed solution (the revenge)