pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
214 stars 23 forks source link

Do we still need `TripleSink` and `QuadSink`? #38

Closed pchampin closed 4 years ago

pchampin commented 4 years ago

In #37 @MattesWhite points out that sinks are not very idiomatic.

I introduced them to be able to consume sources, and I introduced sources because iterators had limitation (see the stream module documentation).

But we could reuse the notion of streaming mode, introduced in the Graph and Dataset traits, to make sources consumable in a more idiomatic way, e.g. with a for_each_triple method (resp. for_each_quad)...

Let's investigate that...

MattesWhite commented 4 years ago

I introduced them to be able to consume sources, and I introduced sources because iterators had limitation [...].

As TripleSource has no longer a lifetime those limits are now mitigated. Furthermore, TripleSources as fallible Iterators are well handled by the resiter crate (even if I'm a bit concerned about its license, maybe fallible-iterator can be a replacement). In addition, we developed StreamError to handle situations where both source and sink could fail.

But we could reuse the notion of streaming mode, introduced in the Graph and Dataset traits, to make sources consumable in a more idiomatic way, e.g. with a for_each_triple method (resp. for_each_quad)...

I don't realy understand what you want to express here. As stated in my quoted comment, I think that all current uses of Sink can be replaced by direct implementations. The only method I miss for convienience is an adapted version of Iterator::collect(). I would do something like:

trait FromTripleSource {
    type Error;

    fn from_triple_source<TS: TripleSource>(ts: TS) -> Result<Self, StreamError<TS::Error, Self::Error>>;
}

// and this completed by
trait TripleSource {
    ...

    fn collect_triples<FTS: FromTripleSource>(self) -> Result<FTS, StreamError<Self::Error, FTS::Error>> {
        FTS::from_triple_source(self)
    }
}
pchampin commented 4 years ago

From your comments above, I gather that my explanation (in the tripe::stream module) of why we need TripleSource is not clear enough. But thanks for your question about collect, that's a perfect example of the (too) strong assumption that Iterator makes on its items.

Here is a proposal to replace (or complement) the current doc in triple::stream. Is it clearer for you?


Consider a parser using 3 buffers to store the subject, predicate, and object of the triples it parses. Each time it extracts a triple from the data, it yields it (as 3 references to its internal buffers) through the for_each_triple method. Then, it reuses the buffers to store the data for the next triple, and yields the new triple, as 3 references to the same buffers.

An iterator can not be implemented this way, because items yielded by iterators are expected to live (at least) as long as the iterator itself. Consider for example the collect method of iterators, it would not work if the items were as short lived as in the example above.

Because many parsers (as well as other triple sources) will be implemented in a manner similar to that described above, we have to provide a trait with weaker assumptions on its items than Iterator has.

The alternative would be to wrap such parsers with a layer that would copy the data from internal buffers to fresh buffers for each triples, and we do not want to impose that cost on all implementations — especially when many consumers will be happy with short-lived references.

MattesWhite commented 4 years ago

Sorry but I don't see the use case you describe. It don't makes sense to me to have short lived buffer contents.

Here are the use cases (for parsers) I can imagine and roughly how I would implement them:

Parse from String

The parser is implemented in a zero-copy way and gives out RefTerms (TD = &'source str). In this case RefTerms can be stored within the parser without a problem as the lifetime is tight to the source so the references 'content' will outlive the parsing anyway. Note that instead of &'source str you can always return &'static str as it is guaranteed to outlive 'source. This can become handy when some shorthand like a or implicit datatypes are parsed in Turtle. So no need for internal, owning buffers here.

Parse from big document

Lets assume that the RDF-document we parse is to big for our machine's memory. Indeed, this is an interessting case where it could be desireable to only parse one triple (NTriples) or one block (Turtle) at a time and give out references of this temporary read. Here I could think of an architecture of iterator-of-iterators. The outer-iterator reads from the document ('doc) units of interesst ('doc : 'part this means the lifetime 'part is ensured to be shorter than 'doc see here). Those parts then are parsed individually by parsers that hand out RefTerm<'part>. I think this could work.

Read from persistent data structure

Here I think no parsing is needed but Graphs are directly constructed (e.g. HDT format). For this case you found already a very elegant solution.


As you can see in my opinion I can't figure out the problem you are facing.

BTW It is not clear to me what the function for_each_triple() is supposed to do... something like Iterator::for_each()?

And the most important question: What has this to do with sinks?

pchampin commented 4 years ago

Parse Strings

Things are not that simple... Consider the following turtle

@base <http://example.org/foo/bar/>
<../BAZ> <http://schema.org/label> "\u2693"; 
    <http://example.org/other-prop> "\u2744".

The subject of the first triple, can not reference to the inner data of the string (it must contain http://example.org/foo/BAZ), and neither can the object (it must contain ). So you need internal buffers. And once the first triple has been yielded, it makes sense for the parser to reuse the internal object buffer to store the new literal.

Parse big files

What you are suggesting is to have different APIs for parsing strings and files... This is not really convenient for the user.

What does it have to do with sinks

As I put it in the intro of that issue: I introduced sinks to be able to consume sources, and I introduced sources because iterators had limitation. By improving the design of sources (with the for_each_triple method), we could make it possible to consume a source without a dedicated sink object.

And yes, the for_each_triple methods is supposed to work the same way as Iterator::for_each. Again, sources are just iterators with weaker assumptions on their items.

MattesWhite commented 4 years ago

Okay, I see now that there are non trivial cases parsing RDF. So when I understand you correctly in your example you want to return a literal "☃" instead of "\u2744" which means you store the ☃ inside the parser give a reference to it out in this iteration and reuse this buffer in the next iteration, correct? Hm... difficult thing if we don't want to copy each term...

[...] because iterators had limitation.

You mentioned this before but you didn't give any examples. Could you please explain this further and how sinks are able to better consume a source?

In fact, I had real trouble implementing my serializer for TripleSource. For current API, it is not possible to consume a TripleSource that not implements Iterator as there is nothing that points out what elements are to expect. Therefore, I had to write my trait bound for Iterator<Item=Result<Tri, E>> instead of TripleSource

pchampin commented 4 years ago

By "iterator had limitations", I meant "Iterator makes too strong assumptions on the lifetime of their items", which you can rephrase "Iterators do not support short-lived items (i.e. items living not longer than a single iteration)".

I just pushed a branch (still WIP) where I improved the TripeSource trait as suggested above (plus filter, map and map_filter as a bonus). Also, a TripleSource now indicates the kind of triples it yields (reusing the trick we used for Graph's iterators to avoid the lifetime issues).

It makes it possible to consume a TripleSource using for_each_triple or try_for_each_triple, without requiring to implement TripleSink, which can be simply removed.

MattesWhite commented 4 years ago

This looks awesome now I understand what you are talking about. However, it could be hard but a next_triple() method would be nice.

pchampin commented 4 years ago

This looks awesome now I understand what you are talking about.

Thanks :-)

However, it could be hard but a next_triple() method would be nice.

Yeah, I though about it... It could look like that:

    fn next_triple<'s>(&'s mut self) -> StreamedTriple<'s, Self::Triple>;

and the bound on lifetime 's would prevent to call next_triple again while the returned streamed triple is alive (as this would cause two mutable borrows on self).

But... I can not implement this method for Rio parsers, so I ruled it out -- with regrets :cry:

On the one hand, it is a shame do make an API design decision based on one particular implementation. But on the other hand, the UnsafeTriple / StreamedTriple pattern (which is required to make the next_triple method possible) is a quite complex one (you know how hard it was to make it right... you were there :wink:), so we can't expect all implementatiosn to have a similar mechanism. Hence my decision.

Of course, when Rust finally supports GAT, things will be much easier :smile:

MattesWhite commented 4 years ago

I guess you're still trying hard to implement the new TripleSource? Writing a Turtle serializer and a Notation 3 parser, I got a lot of improvements / features pending for sophia. However, I don't want to commit PRs while you change an integral part of the crate. So a short update per week or so would be nice to have :wink: .

BTW I don't depend on a new release. My crate will not be ready for release for a long time. In fact, it is more a playground for my PhD than a serious project. Therefore, I wouldn't mind if your not spending to much effort on releasing a new version.

pchampin commented 4 years ago

I guess you're still trying hard to implement the new TripleSource?

No, I'm just buried in other stuff, much less fun... :-( Sorry to keep you waiting. I'll try to keep things moving, though. I'm very interested in your N3 parser :-)

pchampin commented 4 years ago

Solved by 1e0836ace5cf