Better selectivity estimation

GoogleCodeExporter commented 9 years ago

Better selectivity estimation, i.e., collect meaningful statistics for, e.g., 
triple pattern, join pattern.

Original issue reported on code.google.com by andreas.josef.wagner on 22 Nov 2013 at 12:21

GoogleCodeExporter commented 9 years ago

Hi Andreas,
what do you think about using JMX for those (and other similar) purposes?

Andrea

Original comment by a.gazzarini@gmail.com on 25 Jan 2014 at 8:59

GoogleCodeExporter commented 9 years ago

Hi Andrea,

thanks for looking into this.

I probably should have may this issue munch more specific. In our current 
implementation we only have heuristic-based selectivity estimation [1]. This 
implementation mainly based on [2] and takes some ideas from the paper in [3].

Unfortunately, our SPARQL performance is not "too good" - as pointed out by our 
recent benchmark [4]. So, one way to improve this would be to create better 
query plans via a more accurate selectivity estimation.

In fact, a colleague of mine supervised a master thesis on this topic, where 
the student implemented a much better estimation for cumulusRDF. However, this 
code is completely untested and done by a student ;) So ... one would have to 
spend some time on it.

In fact, the actual problem is: how to efficiently create meaningful triple 
pattern (or even join pattern) statistics via Cassandra. There also have been 
some posts on the cassandra mailing list about this, e.g., [5].

Overall, this is not a trivial problem - however, I think we should target it 
as a longterm goal/issue.

Kind regards
Andreas

[1] edu.kit.aifb.cumulus.store.sel.HeuristicsBasedSelectivityEstimator
[2] org.openrdf.query.algebra.evaluation.impl.EvaluationStatistics
[3] Heuristics-based Query Optimisation for SPARQL
[4] NoSQL Databases for RDF: An Empirical Evaluation
[5] 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Tracking-word-f
requencies-td7592285.html

Original comment by andreas.josef.wagner on 26 Jan 2014 at 1:33

sshikov / cumulusrdf

Better selectivity estimation #14