Serialize isomorphic graphs to identical Turtle

GordianDziwis commented 3 months ago

I have a project where I export diagrams to RDF. How can I influence the order of statements when I serialize a sophia graph to turtle?

pchampin commented 3 months ago

@GordianDziwis thanks for your interest in Sophia.

Short answer is: you can not guarantee that the Turtle you produce is 100% identical to the source turtle.

Longer answer

What you can do to preserve order

the result of parsers are triple sources, which are ordered ;
you need to store the triples it in a structure that preserves that order, which is not the case of all Graph implementations (because inherently RDF graphs are sets, hence not order);
an example of Graph implementation preserving order is Vec<T> where T: Triple;
serializers have a serialize_triples method which expects a triple source, and is more likely (see below) to preserve that order in the serialization.

Limitations

The contract of [Serializer::triple_source] does not guarantee, in general, that the order of the triples in the source will be preserved, in the serialization.

the n-triples serializer will preserve order, because that's the simplest thing to do ;
the turtle serializer will not preserve order when the pretty option is on, because it prioritizes conciseness and, well, prettyness;
the turtle serializer without the pretty option will, currently, preserve order, but I can't guarantee that future implementation will do (see rationale below).

More generally, there are many issues, beyond triple order, that make it practically impossible to preserve the Turtle representation between parsing and serializing. Take, in particular, prefix declaration:

currently, Sophia does not provide a mean to retrieve the prefix declarations of a given Turtle source (although this is planned for the near future (see #45);
even when it does, it will not account for prefix overriding, as in the following example:
```
@prefix : <https://example.org/1/>.
:s :p :o.
@prefix : <https://example.org/2/>.
:s :p :o.
```
and the Sophia turtle serializer will never generate something like that.

Another issue would be heterogenity with "prettiness". Consider the following:

@prefix: <https://example.org/>
:s :p1 [ :a :b ].
:s :p2 _:b.
_:b :c :d.

This turtle is a mix of pretty and non-pretty. There is no way to serialize it back as is with Sophia.

GordianDziwis commented 2 months ago

Thank you for your detailed answer!

I do not care so much about the order of the triples or that TurtleIn == TurtleOut for TurtleIn => Graph => TurtleOut, but that the same graph always results in the same Turtle document.

The first use case is version control, I build a graph programmatically and have the serialized graph in git.

And as you said a graph is a set, but the triple source for a serializer is ordered and the order can influence the serialization.

For me, it would be enough if the Turtle serializer with pretty would produce the same Turtle for the same ordered triples from a triple source (is this already the case?). For any order would be even nicer.

pchampin commented 2 months ago

Thank you for your detailed answer!

The first use case is version control, I build a graph programmatically and have the serialized graph in git.

got it

And as you said a graph is a set, but the triple source for a serializer is ordered and the order can influence the serialization.

indeed

For me, it would be enough if the Turtle serializer with pretty would produce the same Turtle for the same ordered triples from a triple source (is this already the case?).

I don't know, off the top of my head, if that's already the case, but even if it is currently, I would not rely on it, because I would not consider it to be a design goal of the Turtle serializer.

What you want is a canonical representation of the RDF graph. The good news is that there is a brand new standard for that (https://www.w3.org/TR/rdf-canon/) and that it is implemented in Sophia (https://docs.rs/sophia_c14n/latest/sophia_c14n/rdfc10/index.html). The no-so-good news is that this canonical representation is based on N-Quads (N-Triples if you have a single graph), so it is much more verbose than "pretty" Turtle. But note that N-Triples is a subset of Turtle, so you can still feed it your your Turtle parser and it will work.

GordianDziwis commented 2 months ago

Yeah this is what I want, thanks!

pchampin / sophia_rs