pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

Concerns about performance impact of triples_matching in the new Graph interface in 0.8.0-alpha #130

Closed KonradHoeffner closed 9 months ago

KonradHoeffner commented 1 year ago

I started refactoring HDT to the upcoming new Graph trait and I really like that I now only have to implement triples() and triples_matching() instead of a huge amount of triples_with_xy methods but if I understand the new trait correctly it could be a threat to high performance querying.

Right now I have it like this (taking your default implementation and replacing the resiter part as I don't have that dependency:

    fn triples_matching<'s, S, P, O>(&'s self, sm: S, pm: P, om: O) -> GTripleSource<'s, Self>
    where
        S: TermMatcher + 's,
        P: TermMatcher + 's,
        O: TermMatcher + 's,
    {
        Box::new(
            self.triples().filter(|r| r.is_ok()).filter(move |t| {
                t.as_ref().unwrap().matched_by(sm.matcher_ref(), pm.matcher_ref(), om.matcher_ref())
            }),
        )
    }

One of my datasets has 135 million triples. If I use this implementation, HDT would need to decompress and sequentially iterate through all of them, which would need a huge amount of time. All the usual query methods depending on the triple pattern, like using an index or using binary search, would not work.

A workaround would be to panic or return an error when any of the given matchers return a None value for the constant function but at that point it would be confusing to users and unnecessarily complex for that purpose.

Much more problematic however, is that, as far as I'm aware of in my limited Rust knowledge, you can't pattern match on Traits as you can with Enums. This means I have no way to know whether the given matcher is actually sophia::api::term::matcher::Any.

Suggestions

In my opinion, querying triple patterns are the central "thing" for a graph and "matching" should be renamed to "filter" so that it aligns better with SPARQL and users can estimate the performance impact, here are some suggestions:

Single Optionals

I would prefer this as I think matching multiple values is not used much in practice (and you could still use the old triples_matching for that).

fn triples_with_pattern(sto: Option<&Term>, pto: Option<&Term>, oto: Option<&Term> -> ...

This is nearly the same as your current impl<T> TermMatcher for Option<T> with the important distinction that None would match everything because it does not make sense to use a matcher that matches nothing, as you will never get any results anyways. This would allow a straight forward and optimized implementation like this:

if let Some(st) = sto {
  if ...
   /// use subject-based index
} else if ...
{
 /// use predicate-based index
}
else 
{
 /// use object-based index
}

Slices

While I would prefer this less, it would allow matching multiple (but not all) values. This is the nearly same as your current impl<T> TermMatcher for &[T], again with the distinction that empty slices would mean all values, not none of them.

fn triples_with_pattern(sts: &[Term]>, pts: &[Term], oto: &[Term]) -> ...
pchampin commented 1 year ago

I think your concerns about the new API are unfounded, I'll try to explain.

But you are right: this implementation of triples_matching is suboptimal -- and there is not much point in overriding the default implementation in that case, because this is more or less exactly what the default implementation is doing.

If you are able to optimize at least some cases (e.g. when the subject matcher is a constant), then you should adopt a pattern similar to what the default implementation of triple_matching was doing in v0.7. This is what you started doing here. My suggestion at the end of that issue, of adding a filter_triples for the non-constant matchers, was only to support the matchers that you can't optimize. It was not supposed to replace all the smart things you were doing above!

Much more problematic however, is that, as far as I'm aware of in my limited Rust knowledge, you can't pattern match on Traits as you can with Enums. This means I have no way to know whether the given matcher is actually sophia::api::term::matcher::Any.

The beauty of it is that you don't need to. The compiler does it for you, thanks to monomorphisation. In a nutshell, when you have a generic function f<S, P, O>(), the compiler compiles several versions of that function, namely one for each combination of concrete types <S, P, O> with which you actually use it.

So when you call triples_matching(some_term, Any, Any), it compiles a version of triples_matching for the types <SimpleTerm, Any, Any>. Since SimpleTerm::constant() always returns Some(...), the compiler will collapse the match or the if that tests if the subject is a constant. Since Any::matches(...) always returns true, the call to Iterator::filter or TripleSource::filter_triples will also be optimized away... So that particular version of triples_matching will compile down to exactly what you would have written in triples_with_s. :grin: Does that make sense?

PS: of course, the correct way of implementing triples_matching, taking advantage of TermMatcher::constant and of monomorphisation, should be thoroughly documented.

pchampin commented 1 year ago

NB: your implementation of triples_matching above is not exactly equivalent to the default one, because it discards the errors that may occur in triples (because of .filter(|r| r.is_ok())), which is questionable.

pchampin commented 1 year ago

Another remark: if the subject matcher is a slice of more than one terms, you currently have to treat it as a generic matcher because constant() returns None. In your case, this will fallback to filtering on all triples -- while it would probably be more efficient to retrieve the triples for each term in the slice individually.

So maybe the constant() method should be complemented (or even replaced) with a more generic list() method that would return an exhaustive list of the values accepted by the matcher. More precisely, it would return None if such a list can not be built, or Some(thing) if it can...

KonradHoeffner commented 1 year ago

A list() method would certainly be faster for HDT, because then the filtering can be done on the ID level, whereas otherwise the strings would need to be extracted first. Or depending on the query type, size of the list and number of subjects/predicates/objects it could also be faster to execute the query for each item in the list and combine the results.

However I don't use such a query in my own applications, so as long as I don't get user feedback that someone uses and has performance problems with non-constant non-any matchers, I will just implement the approach you suggested in the other issue with the monomorphisation and document and benchmark it .

Thus currently I'm fine with the existing solution and I would prefer that in case you add the list method, the constant() method is still available.

P.S.: Yes you are right, I need to improve the error handing.

pchampin commented 9 months ago

Thus currently I'm fine with the existing solution

I guess we can close this issue, then?

KonradHoeffner commented 9 months ago

Oh yes, sorry, I will close it!