pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
214 stars 23 forks source link

New serializer API #37

Closed pchampin closed 4 years ago

pchampin commented 4 years ago

with TripleSerializer and QuadSerializer traits, and TripleStringifier and QuadStringifier specializations. This is meant to replace the "duck-typing" style Config objects.

In the process, removed some duplicate code btw nt and nq serializers.

pchampin commented 4 years ago

@MattesWhite @Tpt whould you mind having a look at this PR? This is the long awaited refactoring of the serializer module, in the line of what I did with parsers. What do you think?

pchampin commented 4 years ago

In a subcomment above, I promised another proposal, and I meant it, but actually I came back to my original design :-/. Let my try to explain how.

Currently, the parser library works like that:

// create a Parser -- this is not specified by the Parser trait
let p = TurtleParser::new().with_default_prefixes(&prefix_map);
// use it to parse file f1 in graph g1
p.parse(f1).in_graph(&mut g1);
// alternatively, use the triple stream API to parse file f2 in graph g2
let parsed_triples = p.parse(f2); // returns a TripleSource
g2.insert_all(parsed_triples);

I find this API intuitive, and easy to read and write.

If we try to imagine a perfectly symmetrical API for serializers, it would look like

// create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph(&g1).in_(f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file);

This looks (IMO) equally nice to the Parser example, but it raises the question of what the serialize_graph method returns, and furthermore what would be the use of the returned object, except for calling its "in_" method (granted, the name is not good, but that's not the point here). So I came to the conclusion that this intermediate object should probably not exist, and I came back to the original design.

// create a Serializer -- this is not specified by the Serializer trait
let s = TurtleSerializer::new().with_prefixes(&prefix_map);
// use it to serializer graph g1 in file f1
s.serialize_graph_in(&g1, f1);
// alternatively, use the triple stream API to serialize graph g2 in file f2
let turtle_file = s.sink(f2); // returns a TripleSink
g2.triples().in_sink(turtle_file);

Does it make more sense?

MattesWhite commented 4 years ago

I don't say that your proposal doesn't make sense. The serializer trait you suggest is perfectly aligned to the source/sink model you follow.

I think of a different model for serializers. For me a serializer's purpose is to serialize data to one target. I agree that this means there a two steps for initialization. First create some configuration that you may be able to share across threads and so on. The difference in my thinking is that the concrete serializer is initialized with that config and a target, so the one serializer is owning it one target. Throughout the life of the serializer I can add one or more graphs, triples and so on. Eventually, the serializer is consumed and the target is returned (therefore fn finish(self) -> Result<Self::Outcome>).

and furthermore what would be the use of the returned object,

Therefore, I suggested to return Result<()> until serialization is finish()ed nothing is returned but errors are reported.

To sum up, I prefer sharing a config and built serializers on demand, e.g. per thread. As far as I understand your solution shares one serializer that is responsible for serializing to several targets. I think that can become a problem when the serializer is used in different threads.

MattesWhite commented 4 years ago

Another point for using one serializer per target is streaming Turtel serialization. In order to be able to produce property and object lists the serializer needs to store the last subject and predicate it serialized. This can be used to continue serialization over two streams. When one serializer is used for several targets this would not work. Worse, it could produce incorrect serializations, e.g. continue the serialization in another target.

pchampin commented 4 years ago

@MattesWhite Both approaches have their pros and cons. They both have their risks/limitations (for example, it does not always make sense to serialize two graphs into the same file -- with RDF/XML, that would produce an invalid document), but they also both have their valid uses cases.

So my conclusion is that we should probably support both approaches. I will amend the PR in that sense on the next few days.

Thanks a lot for your critical eye, and for challenging me into improving the design of Sophia :-)

MattesWhite commented 4 years ago

Indeed, I thought about this problem two. And I'm curious about your solution. I suggest to solve this problem by consulting the RDF spec:

An RDF document is a document that encodes an RDF graph or RDF dataset in a concrete RDF syntax, [...].

Consequently, the serialization of an Graph would also finalize the serialization process - for a triple serializer, as you proposed. Maybe the best solution is realy to provide two serializer traits. The first for streaming triples that is capable of consuming several TripleSources before finishing. The second takes one complete Graph and finishes immedeatly. Where the streaming quads version should also be able to take Graphs.

To sum up:

trait TripleSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<TS: TripleSource>(&mut self, ts: TS)
        -> Result<(), StreamError<TS::Error, Self::Error>>;
    fn finish(self) -> Result<Self::Outcome, Self::Error>;
}

trait GraphSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<G: Graph>(self, g: &G)
        -> Result<Self::Outcome, StreamError<G::Error, Self::Error>>;
}

trait QuadSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize_stream<QS: QuadSource>(&mut self, qs: QS)
        -> Result<(), StreamError<QS::Error, Self::Error>>;
    fn serialize_graph<G: Graph>(&mut self, g: &G, name: Option<&Term>) 
        -> Result<(), StreamError<G::Error, Self::Error>>;
    fn finish(self) -> Result<Self::Outcome, Self::Error>;
}

trait DatasetSerializer {
    type Outcome;
    type Error: std::error::Error;

    fn serialize<D: Dataset>(self, d: &D) 
        -> Result<Self::Outcome, StreamError<D::Error, Self::Error>>;
}

Thanks a lot for your critical eye, and for challenging me into improving the design of Sophia :-)

Your welcome. I realy like our discussions as well and the results we found are pretty well I would say. Also I'm constantly a little bit afraid to go a step to far and start annoying you. So if this is ever the case please feel free to stop me ;-)

pchampin commented 4 years ago

Here is a new version of the serializer API. It is still very similar to my original proposal, but the serializing methods are now clearly carried by the sink (TripleSerializingSink and QuadSerializingSink, respectively).

For the sake of symmetry with the parser API, the use pattern that I described above is still possible, with the serialize_triples_in (etc.) methods. But those are mostly shortcuts for generating a sink and calling the corresponnding method on it.

Another use pattern, closer to the one you were advocating, is now possible:

// serializing graph g1 into file f1
let ttl_file = TurtleParser::new().with_prefixes(&prefix_map).to(f1);
//                                             this is new -> ^^^^^^
ttl_file.serialize_graph(&g1);

Note that I didn't follow your latest suggestion to have separate traits for serializing sources and serializing graphs. I would rather leave it to implementors to decide what they allow and what they disallow. It makes sense in Turtle to serialize two graphs in the same file; it does not in RDF/XML, so that implementation may chose to close the underlying file at the end of serialize_graph. Still, I can imagine some more exotic target where serializing several graphs in RDF/XML would make sense...

MattesWhite commented 4 years ago

Note that I didn't follow your latest suggestion to have separate traits for serializing sources and serializing graphs. I would rather leave it to implementors to decide what they allow and what they disallow.

Okay you are right. It is okay to handle this by throwing an error if a second graph shouldn't be serialized.


For the sake of symmetry with the parser API, ...

I can understand that you stick to the 'serializer + sink' pattern to be symmetric with the parser. But in my view parsing and serialization are two inherent different things. Therefore, symmetry is not a valid design factor for the serializer API for me. On the first glance both parsing and serializing are symmetric. "Go from document to data" and "go from data to document". However, this is already the point where similarity ends. My statement can be proven by simply parsing a document and immediatly serialize it again. I'm pretty sure that for any set of parser and serializer the output will not be the exact same as the input despite being correct (as long as the input does not come from the serializer itself).

On an implementation level, the parser is right in supporting a lazy evaluation, i.e. an Iterator API, as we might only want to check if a certain triple is contained or we want to only parser a certain set of triples. Here the Iterator can help us to stop early and save resources. On the other hand, when we are serializing data we are in control of it already, only giving the triples to the serializer we want to. This means that a serializer will consume a TripleSource at once.

To sum up, I advocate to break the symmetrie of parser and serializer. For me the TripleSerializingSink is all the API we need for a serializer in a core-crate.


BTW the insymmetry of parsing and serializing is an example why I'm not a fan of the Sink-trait in general. It adds complexity where it don't have to be. For example, the Inserter and Remover sinks could be removed by implementing the same thing directly in Graph's methods. Maybe a more convincing point is that there is no special sink type in Rust I can think of no example neither in the std-lib nor in any popular crate I can think of. Of course I will change my mind if you bring any counterevidence.

pchampin commented 4 years ago

About parsing a file and reserializing it: the fact that its does not always round-trip does not mean that the general process is not symetric... We are considering different levels.

About sinks, you have a point, though...

pchampin commented 4 years ago

Superceded by 9a97dae