knowsys / nemo

A fast in-memory rule engine.
https://knowsys.github.io/nemo-doc/
Apache License 2.0
61 stars 6 forks source link

Increasing Join selectivity by directly joining with constant values #484

Closed matzemathics closed 3 months ago

matzemathics commented 3 months ago

This should greatly improve performance on somewhat sizeable tables, because of increased join selectivity.

aannleax commented 3 months ago

Don't we already handle this case in the filter operation? https://github.com/knowsys/nemo/blob/ac0cf3518bba990f1e69b60d167a5b40580791c8/nemo-physical/src/tabular/operations/filter.rs#L102

matzemathics commented 3 months ago

Possibly already handled (see above)

I did benchmark this, and the results speak for themselves:

@prefix dev: <file:///dev/>.

@import works :- json {resource="works.json"}.

items(?i, ?author_name) :-
    works(_, "items", ?a), works(?a, ?i, ?x),
    works(?x, "title", ?title_array),
    works(?title_array, 0, ?title_id),
    works(?title_id, value, ?title),
    works(?x, "author", ?author_array),
    works(?author_array, 0, ?author_id),
    works(?author_id, "family", ?author),
    works(?author, value, ?author_name).

@export items :- csv { resource="" }.
Benchmark 1: ./nmo-main authors.rls
  Time (mean ± σ):     13.659 s ±  2.245 s    [User: 13.643 s, System: 0.012 s]
  Range (min … max):   10.542 s … 16.577 s    10 runs

Benchmark 2: ./nmo-const-join authors.rls
  Time (mean ± σ):      3.997 s ±  0.599 s    [User: 3.985 s, System: 0.011 s]
  Range (min … max):    2.831 s …  4.370 s    10 runs

The point is that filtering is done too late in the pipeline, so this optimisation in the filter-code only helps for computed variables, but is still inefficient for variables that can be joined on.