pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
214 stars 23 forks source link

Make 'Term' a trait #55

Closed MattesWhite closed 4 years ago

MattesWhite commented 4 years ago

Proposed first in #35 I suggest to introduce a trait Term.

Objective

sophia's aim is to become a central API for RDF in Rust (#23). As such it would be nice if third party terms could smoothly interoperate with sophia's ecosystem, e.g. rio's terms could be stored in sophia's HashGraph, N3 terms from metis could use the Serializer interface or another crate could provide compatible, ASCII-based terms that are based on &[u8] (maybe even going towards no_std-RDF :blush:).

Required changes

With the latest effort to turn each kind of term (IRI, literal, blank node and variable) into an own type with an aligned API the step to create a trait is no longer that much work. A problem could be that this could turn advanced traits, such as Graph, in a mess of trait bounds. I'd like to commit a PoC-PR to see how it goes.

Suggestion

Here a suggestion how a trait may look like:

trait Term: Clone + Hash + Debug + Display 
    + PartialEq + PartialEq<Iri> + PartialEq<Literal> + PartialEq<BlankNode> + Eq
where
    Iri: TryFrom<Self>,
    Literal: TryFrom<Self>,
    BlankNode: TryFrom<Self>,
{
    type TermData: TermData;

    fn value(&self) -> String;
}

The focus should be to keep the Term trait as small as possible while other behaviour is kept separate in other traits, e.g. Resolve.

pchampin commented 4 years ago

I'm waiting to see your POC, but I'm not sure exactly what benefit you would gain from such a trait, especially if Iri, Literal and BlankNode are still constrained to be concrete types...

Note that nothing forces implementations of Sophia's traits (Graph, Dataset, Parser, Serializer...) to use Term internally. For example, Rio parsers do not, but still interoperate with Sophia. All that is required is to provide easy ways to convert internal "terms" to and from Sophia's Term.

For that matter, thanks to the TermData parameter, Term is already rather flexible:

That being said, as I wrote, I'm curious to see your POC, and I remain open to discussion on this issue.

pchampin commented 4 years ago

A part of this comment from @MattesWhite on #62 is more relavant to this topic:

Ideally, I could reuse the prefix-parser from Turtle for N3. Which means that I would like to return an Iri as Turtle and N3 use different terms which both implement From<Iri>. In fact, my N3Term looks something like:

enum N3Term<TD> {
    Iri(Iri<TD>),
    Literal(Literal<TD>),
    Existential(BlankNode<TD>),
    Universal(Variable<TD>),
    Formula(Formula<TD>), // some kind of `Graph<N3Term>`
    List(Vec<N3Term<TD>>),
}

As you can see I would like to reuse a Graph implementation of sophia which I can't do because it's fixed to sophia_term::Term.

Thanks, this helps me understand the benefit you expect from a Term trait.

I had already given some thoughts (although not as much as you) to N3 and Sophia, and my idea was that N3 could be encoded in the generalized model provided by Sophia. Namely:

So rather than forcing some N3Term type into a plain Sophia Graph, I would have gone for using a Sophia Dataset as a back-end, and wrapping it in a N3Graph providing a higher level API. I don't know if that makes sense.

May be you have already explored this lead and found it unfeasible or suboptimal... It is very possible that I am overlooking some problems of this approach.

But what you propose seems to extend Sophia's scope much beyond RDF. Granted, the current generalized model already extends it... but this is even much more. For example, what you suggest is that not all implementations of the Term trait would be entirely interoperable. If I get a triple out of an N3Graph, and that triple involved a formula, what happens when I try to insert it in another graph?

MattesWhite commented 4 years ago

Granted your suggestion for a N3Graph could work. But you are right. Such a hacked implementation is far from what I imagine. Given the complexity we already have in returning triples, further abstraction don't make things better. All in all, I think I should be satisfied with the possibility to re-use sophia's Term and its subtypes, implementing N3 on top is still a huge amount of work. However, I'm convinced that an N3-engine in Rust could out-perform existing N3-engines by orders of magnitudes if we look at cwm (Python), ldfu (Java) and EYE (Prolog).

In summary I'll close this issue as this (as you mentioned) exceeds the intended implementation of RDF to far.


Problems with Dataset as N3Graph

A short list of problems of your suggestion for the record:

As mentioned above, I imagine a performance optimized N3-engine. Accordingly, an optimized implementation of a Formula is more practical than re-using sophia's Graphs or Datasets.

pchampin commented 4 years ago

Again, I am thrilled by your project of an N3 implementation in Rust. And I sympathize entirely with your concerns about performances. Actually, thinking a little more about that, I beleive this should be the opposite: N3Graph should have its own implementation, but comply with the Graph traits (and/or possibly the Dataset traits, although that requires some thinking...).

MattesWhite commented 4 years ago

While thinking on a graph implementation and the map() stuff I stumbled upon another use-case for a Term trait:

In a bunch of cases terms are passed as read-only references, e.g. the filter methods of Graph. Currently, reference passing is either done by monomorphization or &RefTerm. Both have disadvantages. Monomorphization blows up code-size and compilation times while &RefTerm requires often the creation of a new Term instance. I think this is one of the rare cases where passing a trait object per &dyn TermTrait would be more sufficient.

The described use-case is similar to TermMatcher and indeed we could re-use or improve it.

I haven't flashed out this thought entirely but I think we could use this to simplify some of sophia's APIs. In addition, it may make sense to move Graph's filter-methods to TripleSource (as well as for Dataset and QuadSource) (#52).

pchampin commented 4 years ago

If I understand trait objects correctly, &dyn TermTrait is not a direct reference to your, say, RcTerm, but a reference to an object, automatically created by the compiler, that contains a pointer to your RcTerm as well as a table of pointers to its methods (the equivalent of C++'s virtual table). My guess is that creating this trait object is at least as expensive as creating a RefTerm...

But yes, we could consider simplifying the API, and requiring RefTerms everywhere the current API accepts Term<T> for any T. Note however, that with the current approach, developers have a choice: reduce compile time and image size by only calling Graph methods with RefTerms, or reduce execution time by passing whatever term they have at hand, and relying on monorphization. In any case, thorough testing of different options is the only way to know which is the best.

In addition, it may make sense to move Graph's filter-methods to TripleSource

This is already the case. Any iterator of Result<Triple, _> is a TripleSource. And as a bonus, itis also an iterator :-) https://github.com/pchampin/sophia_rs/blob/f8f030448efa70c2a0576319ec796c6fe1935ec4/sophia/src/triple/stream/_iterator.rs#L8-L13

MattesWhite commented 4 years ago

Okay I agree with you the use of RefTerm mitigates the usage of a trait-object &dyn TermTrait. In fact, I tried a little bit around to implement a TermTrait that hasn't a <TD> type parameter which is necessary to avoid monomorphization. If it should be usable it requires to be comparable to at leas some kind of &Term<TD> and I didn't find a way of doing it with returning a RefTerm :roll_eyes: ...


If I understand trait objects correctly, &dyn TermTrait is not a direct reference ...

Well, your understanding is mostly true. &dyn Trait is a so-called fat pointer that, as you stated correctly, contains two pointers: The first points to the data passed in and the second points to the datatype's vtable. All in all, the creation of a trait object only requires taking a pointer to the data and the vtable-pointer is probably already defined at compile time which means creating a trait object has nearly the same cost as taking a single normal reference where a RefTerm allocates a whole new Term (see size-table below). The real cost of a trait object is that it's methods can no longer be inlined by the compiler, probably resulting in less efficient/optimized code.

Maybe it is better to show the impact of an trait object when we look at the size of different Terms:

// Results of std::mem::size_of<${type}>():
Size of RefTerm:       72
Size of Term<String>:  96
Size of BoxTerm:       72
Size of MownTerm:      96
Size of RcTerm:        72
Size of ArcTerm:       72
Size of &dyn Error:    16
-------------------------
Size of Term<&str>:    72
Size of Literal<&str>: 64
Size of Iri<&str>:     40
Size of &str:          16

As you can see on my 64bit machine each RefTerm requires the allocation of 72 additional bytes of the stack (9 words!) while a trait object always only has a size of 16 bytes, i.e. 2 words.

pchampin commented 4 years ago

All in all, the creation of a trait object only requires taking a pointer to the data and the vtable-pointer is probably already defined at compile

Makes sense. I stand corrected, then.

The point your raise about the size of Term in general is valid. One question that I still have is: is it worth it two have IRIs split in ns+suffix? How much time and space does it save, compared to the burden of carrying fat Terms around? Unfortunately, investigating this would be very time consuming... This would be easier if Term was a trait, though :-D

Regarding monoporphization of Graph/Dataset methods, it looks as a different (although related) issue to me. Would you mind posting a new issue on that particular subject?

pchampin commented 4 years ago

Thought: a good point about a Term trait would be that types such as Iri or Literal could implement that trait too. This would spare us conversions in many situations.

Other thought: if we want to be able to use &dyn Term, then that puts limitations on the methods of Term. In particular, it can have no "constructor" (method returning Self). This limitation might come back to bite us sometime...

pchampin commented 4 years ago

Yet another thought: currently the Term<T> type implements PartialEq<Term<U>>, meaning that I can write t1 == t2 with any two terms. I don't think we could make it with a trait (especially given the constraints on trait objects).

Of course, we could define a function fn term_eq<T1: Term, T2: Term>(t1: &T1, t2: &T2) -> bool but that would not be as nice. Is this really bad, though...

pchampin commented 4 years ago

@MattesWhite I created a branch term_trait to experiment with this idea.

Have a look at 9e878ad. If that's a kind of things you had in mind, we can try to propagate it into the rest of the API and see how it fits...

MattesWhite commented 4 years ago

I think 9e878ad is a good point to start I added some comments to the commit.

TL;DR:

The idea to downcast TermTrait seems promising. But instead of adding a TermKind I would introduce "KindTraits" that give a &str-only view on the underlying data.

pchampin commented 4 years ago

At last, this is implemented by 91bbb99. This is a big change (including a few unrelated refactoring).

@MattesWhite (or anyone else): if you struggle updating your code to this new version, contact me and I'll to help.

MattesWhite commented 4 years ago

First of all big respect for this major refactoring of the whole crate :clap: . I see the huge amount of work that has gone into it :+1: and the trait really integrates well.

But with such a large change there are some points I'd like to address which could further improve sophia.

str_absolute()

This function currently returns i8 I'd like to suggest to use a dedicated enum. It shouldn't add overhead and makes the code more readable.

term_*() functions

You introduced several term_*() functions to provide generic default implementations for several standard traits, e.g. Display, PartialCmp and so on.

To make better use of Rust's type system I propose to introduce a dedicated type for these implementations:

struct GenericTermImpl<T: TTerm + ?Sized>(T);

// with
impl<T> Deref for GenericTermImpl<T> {
    type Target = T;

    ...
}

// and 
impl<T> From<T> for GenericTermImpl<T> {...}

This new-type clearly shows when a generic implementation is used and that implementors have the opportunity to write optimized implementations for their own types. In addition, this wrapper monomorphizes to nothing so it doesn't add overhead.

raw value

There are a bunch of raw_*() functions. I think there could be better documentation and more readable code by introducing a dedicated type:

pub struct RawValue<'a>(pub &'a str, pub Option<&'a str>);

I'd gladly submit PRs for my proposals if you like. However, with ESWC approaching I have only a limited capacity currently.

pchampin commented 4 years ago

Thanks for the appreciation ;-)

MattesWhite commented 4 years ago

With the latest commit 9464afa the new trait TTerm is now part of the public API and well integrated into the crate. Therefore, I consider this issue successfully done :+1: