Indexing needed - Githubissues

harrisminer commented 11 years ago

Our data is getting around 200GB and our query times are going up quite substantially in terms of time to complete. It seems that with no indexing it is doing full data or full table scans of the data. This is a pretty high priority since our data will begin to grow at 5GB/day. Thanks.

jdegoes commented 11 years ago

What are the filters you use in queries? Those will determine the kinds of indexing we need to implement in order to accelerate this use case.

harrisminer commented 11 years ago

Mostly we filter by date or job title. Very simple WHERE statements.

For example: where date="11/12/2013" and where jobtitle="Database Administrator"

jdegoes commented 11 years ago

Are "date" and "jobtitle" derived attributes (computed on-the-fly in the Quirrel script) or are they present in the loaded data?

harrisminer commented 11 years ago

They are both present in the data.

djspiewak commented 11 years ago

Working a bit on this right now. It's a lot more complicated than it looks on the surface. The fact that the attributes are present in the data makes it slightly easier, but unfortunately not by a lot. No promises when I'm going to have something working, since I'm essentially doing this in my free time, but I'll try to get at least an experimental branch floated as soon as I can.

jdegoes commented 11 years ago

@djspiewak I think persistent indexes are required here. Either (a) the ability to dynamically add an index to some jpath(s), which is updated on ingest; or (b) the automatic indexing of all jpath(s) on ingest.

The persistence comes from the fact that the above types of string literals are hard-coded into the scripts, not produced dynamically; each script skims over only a bit of data, so on-the-fly indexing is not useful here.

Nor is block-level indexing going to be particularly useful, because in general the data will be scattered randomly among all blocks. So basically, fine-grained, persistent indexing is needed, which leads to choice (a) and (b) above.

Several recent query platforms (JethroData and Sqrrl) have made the choice to always index on all columns. For append-only data stores like Precog, this is actually pretty straightforward from an architectural standpoint (far simpler than supporting composite indexes and dynamically adding/removing indexes) and reasonably performant.

djspiewak commented 11 years ago

@jdegoes

You're focusing on the wrong problem here. Storage-wise, indexing is quite easy. NIHDB already includes support for indexing, and it's essentially just sitting there waiting for us to tell it to store things in different arrangements and with different access patterns. The hard part is parsing out the load predicates in the evaluator. It is very, very, very difficult not only to compute these predicates but also transform them appropriately as you push them up the DAG. Even just representing this information in a form which a) doesn't tangle unrelated concerns, b) doesn't redundantly represent identical subalgebrae, and c) is even available to the evaluator at load site is very, very tricky.

So whether we index on all columns or none; whether we decide at ingest time to index or make a statistical determination later; or even if the index is based on a property or a derived value doesn't matter. All of those problems are truly quite trivial relative to just getting the evaluator to infer even simple index bounds.

jdegoes commented 11 years ago

@djspiewak I actually thinking indexing is trickier and would involve more code, but admittedly, that's probably because I have more experience with DAG optimizations.

There's a pretty standard way of handling this problem that relies on defining a commutativity algebra: for two sequentially applied operations A B, establishing (a) whether or not they commute (which is just a boolean relation on all operation types), and (b) commuting them to produce B' A' (which involves defining commutators).

Once you have defined a commutativity algebra, then getting a basic version of this working involves pushing filters up to loads, and then taking advantage of a storage-specific loading mechanism (where possible) for filtering to defined inclusive ranges of JSON values along particular JPaths.

(You can see an implementation of this general approach in an early version of BlueEyes' MockMongo, now gone from this world.)

Things become much more complicated if you extend the algebra to support fork and join (that is, if A B can sometimes be combined to C, and a given operation C can sometimes be forked to produce A B), and if you add many goal-driven optimizations (not just "predicate pushdown", as this one is called), in which case meeting one goal may come at the cost of other goals, which requires a non-trivial cost optimizer.

I believe right now we only have projection pushdown. If we had a well-defined commutativity algebra for this, we could use the same framework for both projection pushdown and predicate pushdown (and of course lots of others).

What approach are you taking or do you know yet?

djspiewak commented 11 years ago

Things become much more complicated if you extend the algebra to support fork and join (that is, if A B can sometimes be combined to C, and a given operation C can sometimes be forked to produce A B)

Yes, and this is part 1 of why index inference is very difficult here. I'm obviously aware of commutivity algebrae. We use them fairly extensively in other optimizations and in various parts of yggdrasil. The problem is that forking and joining is what the DAG does non-stop, and to make matters worse, the forking is implicit and cannot be inferred by looking at a single node (or even a pair of nodes). We can't simply ignore this property in order to handle some subset of cases, because any case involving a where clause is definitionally a join, and any join which is transpecable (a requisite for useful predicate inference) and non-constant must have at least one fork above it (the proof for this is trivial).

That's part 1. Part 2 is that we do not have a closed algebra of operations. We have an open algebra where each operation can be classed with restricted and well-defined properties. These properties, in most cases, allow us some flexibility for distribution and commutativity, but not all of the time. We have no particularly good way of restricting this set of operations, and while we can simply drop cases which involve unsupported operations, doing so will so radically reduce the utility of predicate inference as to make it almost not worth the effort. I think you would be very surprised how few queries would be indexable if we restricted to just the fundamental mathematical operators and dereference.

And finally of course, we have the ever-present problem that the predicate representation passed to Table#load must contain all of the information inferred from the DAG, but in a form which is specific to Table (i.e. via TransSpecs), but the Evaluator cannot make direct use of TransSpec since it doesn't actually contain the logic for determining what is and is not convertable to a TransSpec. Another way of looking at this issue (that reveals even more of the problem) is that Table needs a predicate algebra optimized for compilation and interpretation, whereas the evaluator requires a predicate algebra optimized for composition.

Oh and also there's the lovely problem of the Evaluator not even seeing the filters which derive from a load before it sees the load itself, due to topological ordering.

All in all, it's tricky. And for the record, the projection algebra does not have to define commutativity, since records catenation is a definitionally commutative operation. Predicate commutativity is an entirely different question, since the commutative form of an operation A over some operation B is a function of both operation A and operation B, and may not be defined at all.

What DAG optimizations have you written?

tixxit commented 11 years ago

Miles Sabin wrote a predicate pull-up, right? It was disabled because it broke some tests. Does anyone know why that broke and, if it did work, would that make indexing easier?

djspiewak commented 11 years ago

Miles Sabin wrote a predicate pull-up, right? It was disabled because it broke some tests. Does anyone know why that broke and, if it did work, would that make indexing easier?

I looked at it. It's very, very basic. He basically just separates conjunctions within solve clauses and moves them out if they don't depend on other variables. Essentially, this results in pre-filtering. It's disabled both because it's broken (can't remember what the bugs are) and because it actually resulted in worse performance than allowing the grouping algorithm to handle the filtering.

I'll eventually have to fix that code and re-enable it as a pre-pass in order to start getting some measure of indexing into solve evaluation, but right now it doesn't really help me. :-(

jdegoes commented 11 years ago

I'm obviously aware of commutivity algebrae. We use them fairly extensively in other optimizations and in various parts of yggdrasil.

Only in a completely ad hoc fashion. There is no general-purpose commutativity framework nor machinery anywhere in the code base, and that may be the number one problem to indexing, because it is not simply a matter of building the machinery (although that needs to be done), but actually formalizing and probably even refactoring the existing DAG operations.

Part 2 is that we do not have a closed algebra of operations.

I don't understand (or maybe disagree with) the terminology you're using, but I understand the point you are trying to make. However, it's really not a novel problem, every RDBMS has the same issue. Take a look at the Postgresql code base or any other RDBMS that takes a similar approach for optimization to see how they solve it (basically, you just accept the fact that you can't always reach goals because some operations just can't be rearranged).

Table needs a predicate algebra optimized for compilation and interpretation, whereas the evaluator requires a predicate algebra optimized for composition.

That's interesting but I think it may be over-engineered (or maybe cut awkwardly). I think Table just needs a load function that accepts JSON ranges for different columns. That's what any data storage / back-end that supports indexes will want. It shouldn't be too hard to get there for the simple case that prompted this ticket (conjunctions with static JSON values applied in a filter immediately after loading -- remarkably straightforward use case for predicate pushdown).

And for the record, the projection algebra does not have to define commutativity,

That's actually false, or rather, coincidentally true because of the current set of operations supported by Quirrel.

If I introduced, say, a pivot operation, you would be unable to push projection beyond the pivot. A sensibly-designed commutativity framework would prevent this error since you would have to add a commutator for projection and pivoting to make that possible, whereas no doubt the current code will blindly push projection to load in all cases, even in those that don't make sense. Formalize, formalize, formalize.

What DAG optimizations have you written?

None in Precog, but I'm the original author of the query optimizations in mock mongo, I developed a formal commutativity algebra and framework for Una (and am currently working on one for a data processing language), and have studied most of the open source databases / query engines out there.

That said, you've convinced me that a general-purpose solution to predicate pushdown is a bigger problem than I imagined, mainly because I don't think the DAG operations are completely formalized or factored just right (specifically thinking about bucketing here), and that's really a pre-requisite to developing general-purpose pushdown / pullup machinery.

precog / platform

Indexing needed #533