pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
228 stars 24 forks source link

Please rework TermMatcher #153

Closed KonradHoeffner closed 9 months ago

KonradHoeffner commented 9 months ago

While the TermMatcher trait is simple in a theoretical/mathematical sense, in practice just looping through the whole knowledge base and checking everything with the match method can be too slow to be viable, especially if the knowledge base is only available in compressed form, such as with HDT. This causes workarounds that are cumbersome and error prone, for example in the HDT library I currently treat anything as "Any" where constant doesn't return true. It's also not possible to check for the Sophia Any implementation because the std::any::TypeId::of method only works with a static lifetime like S: TermMatcher + 'static but I can't use that in an implementation., see also https://github.com/pchampin/sophia_rs/issues/150 for a problem the current implementation is causing.

Thus I propose something like the following:

enum TermMatcher {
   Constant(Term)
   Any,
   List(Vec<Term>) // or a set, could help optimization if it is sorted 
   Deterministic(...),  // this one doesn't change, so one could for example map it to a set of database IDs once 
   NonDeterministic(...), // in case the term matcher can return different results on different calls, maybe that is also overkill and it being Deterministic could just be always true and the two could be called "Other"  
}

Would that be possible?

pchampin commented 9 months ago

I am not opposed to improve TermMatcher, but, respectfully, I am not convinced by the solution you propose above, nor by the rationale you have for it (see also my response to #150).

What could be done, however, is to replace the [TermMatcher::constant] by a more complex method (call it optimHint for example), which would be an enum similar to the one you propose above. Different implementations would be able to take advantage of different hints...

Proposal:

enum TermMatcherHint {
    NoHint, // you need to call  the match method...
    Constant(Term), // matches a unique term
    List(Vec[Term]), // matches a pre-determined list of terms
}

I don't think that nondeterministic matcher should be a thing :-)

One thing that might be interesting to add, though, would be Range(Term, Term), (for some determined order on terms), because some implementations (based on ordered arrays or BTrees) might also take advantage of this.


NB: Any would still return NoHint, and that's OK, because calling Any::match, which always returns false, will be optimized away anyway.

KonradHoeffner commented 9 months ago

Yes, you are completely right, since your explanation in #150 I finally understand the monomorphization (at least practically, I still have to read up on the theory behind it) and have no current problem with the TermMatcher trait anymore. I think your proposal sounds great but as I currently don't use, and thus can't benchmark, non-Any non-constant matchers, I'm not sure what would actually be the most useful interface in practice and whether I should close this issue or if you and maybe others want to further develop this idea with the TermMatcherHint. For example, could Constant(Term) and List(Vec[Term]) be unified to something like a Term slice &[Term] so that developers could write a single loop that merges all the results (maybe Rust also optimizes away single element slices?) and that codepath would also take care of the special case of a slice with exactly one element, maybe that would be a minuscule overhead. I would definitely use the TermMatcherHint::List(Vec[Term]) in the HDT library though in case users ever take advantage of it. And as for the Range(Term, Term) I think it would be really useful to have an implementation to test that requires it.

pchampin commented 9 months ago

@KonradHoeffner I'm glad we reached an agreement. I'll open a new issue, in order to spare future readers the trouble of reading through our discussion before getting the core of the proposal.