verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

What are the verdictdbblock used? #371

Closed solangepaz closed 5 years ago

solangepaz commented 5 years ago

Is it possible to know what are the verdictdbblock that VerdictDB uses to execute a query?

pyongjoo commented 5 years ago

Sure. The value of verdictdbblock is used to process only a subset of a scramble (say, 1% of it). The subset itself amounts to a random sample of the original table, so we can compute unbiased estimators using it.

As VerdictDB processes more subsets (associated with different verdictdbblock values), the accuracy of unbiased estimators increases in expectation.

solangepaz commented 5 years ago

Thanks. And is it possible to know, which verdictdbblock are used while executing a query?

pyongjoo commented 5 years ago

Yes, both inside VerdictDB modules and by observing the queries issued to an underlying DB.

In general, they are sequential, i.e., the first query is associated with verdictdbblock=0, the second query with verdictdbblock=1, and so on. These queries may be issued in parallel to reduce the processing time (e.g., for Presto).

solangepaz commented 5 years ago

Do all blocks run until a convergence is reached?

pyongjoo commented 5 years ago

No. At most 10 blocks at a time. As soon as one is processed, the next one runs until convergence. Of course, all blocks may run if your query selects only a few tuples (thus, convergence is slow).

solangepaz commented 5 years ago

Can you tell me where in the code this is done? I needed to know which blocks were executed until convergence was achieved.

pyongjoo commented 5 years ago

If that is your sole purpose, I would simply turn on the DEBUG option and see what queries are being issued.

For Java: https://docs.verdictdb.org/reference/properties/ For pyverdict: use verdict.set_loglevel('debug'). https://github.com/mozafari/verdictdb/blob/master/pyverdict/pyverdict/verdictcontext.py#L222

solangepaz commented 5 years ago

Another question, my idea was to copy the blocks used to another database and then calculate the results. What happens is that for example in the query select sum(l_extendedprice) from lineitem blocks 0,1 and 2 are used and the result through verdictDB is 2,294500756989E11. But if I do this query: select sum (l_extendedprice) from lineitem_scramble where verdictdbblock = 0 or verdictdbblock = 1 or verdictdbblock = 2; the result with MySQL (for example) is 98360078016,67. How can I get the 2,294500756989E11 value through the value 98360078016,67 ?

solangepaz commented 5 years ago

Is it possible to do this through cumulative distributions?

pyongjoo commented 5 years ago

Let me answer this in 10 mins...

pyongjoo commented 5 years ago

So, scaling factors are obtained from the cumul distribution stored in the metadata table.

Ex. cumul = [0.2, 0.5, 1.0] means 20% of data is in block0, 30% of data is in block1, and 50% of data is in block2.

In this case, scaling is performed as follows.

  1. "sum(col) from block0" is scaled by (1/0.2) to produce an unbiased estimator
  2. "sum(col) from block0 and block1" is scaled by (1/0.5) to produce an unbiased estimator
  3. "sum(col) from block0 and block1 and block2" is scaled by (1/1) to produce an unbiased estimator

I believe the actual logic is slightly more sophisticated, but the basics are still the same.

solangepaz commented 5 years ago

OK thank you. So if I divide the query result into the block by the scale factor, should I get a value close to the real one? And how does this work for the join?

pyongjoo commented 5 years ago

Yes to the first question.

For joins,

  1. We first compute the prob dist for each joined blocks, e.g., (block0 from left, block0 from right), (block0 from left, block1 from right), and so on. The prob dist for each joined block is computed by multiplying two individual prob dist. For example, if prob(block0 from left) = 0.2 and prob(block0 from right) = 0.1, then prob(block0 from left, block0 from right) = 0.2 * 0.1 = 0.02.
  2. Then, scale factor = 1 / (cumul prob dist), which is the same as single table cases.
solangepaz commented 5 years ago

Thank you. And another question, how does verdictdb know which blocks to use?

pyongjoo commented 5 years ago

VerdictDB uses block0, block1, block2, and so on.

As it combines the results from different blocks, VerdictDB performs error estimation, and if the error is small enough (or the entire data has been processed), VerdictDB returns a result.

solangepaz commented 5 years ago

How is this error estimation calculated?

solangepaz commented 5 years ago

For example, for a query with group by how do I get the value close to the real? If I multiply by the individual probabilities I do not have that value.

pyongjoo commented 5 years ago

For error estimation, VerdictDB checks if processing new blocks of data change our estimates much. If the difference is small, VerdictDB considers that the estimates are accurate.

For scaling the aggregate values for group-by queries, there is no difference. If you can provide an example, we can have a more concrete discussion.

solangepaz commented 5 years ago

For example, in this more "complex" query:

SELECT
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
FROM
    customer,
    orders,
    lineitem
WHERE
    c_mktsegment = 'BUILDING'
    AND c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate < date '1995-03-15'
    AND l_shipdate > date '1995-03-15'
GROUP BY
    l_orderkey,
    o_orderdate,
    o_shippriority
ORDER BY
    revenue desc,
    o_orderdate
LIMIT 20;

How are calculations done to get a value close to the real?

pyongjoo commented 5 years ago

Let's think about how to scale the revenue value for l_orderkey = 1 and l_orderdate = '1998-01-01' and o_shippriority = HIGH.

Suppose:

  1. we have processed 40% of the original data (e.g., sampling ratio is 100%, verdictdbblock ranges from 0 and 99, and we have processed verdictdbblock between 0 and 39).
  2. sum(l_extendedprice * (1 - l_discount)) is $3000 (for the group) after processing the 40% of data

Then, an unbiased estimator for the total sum (for the group) is $3000 / 0.4 = $7500, according to the Horvitz-Thompson Estimator.

solangepaz commented 5 years ago

This is not working for the blocks I have, for example: lineitem_scramble "storedProbDist" ~= [0.15, 0.29, 0.43, 0.57, 0.71, 0.86, 1] orders_scramble "storedProbDist" = [0.5, 1] customer_scramble "storedProbDist" = [1]

If I run the above query for the blocks lineitem_scramble = 0, orders_scramble = 0, customer_scramble = 0 , I have this result:

+------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +------------+-------------+-------------+----------------+ | 2643618 | 213289.7323 | 1995-03-10 | 0 | | 4389088 | 170021.9012 | 1995-02-28 | 0 | | 5206948 | 165405.0148 | 1995-03-14 | 0 |

Calculations for estimating revenue would be, for example for the first group: 213289,7323 / (0.14 0.5 1)= 2986056,252

If I run the above query for the blocks lineitem_scramble = 1, orders_scramble = 1, orders_scramble = 0 , I have this result:

+------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +------------+-------------+-------------+----------------+ | 5664482 | 204080.1089 | 1995-03-14 | 0 | | 4460578 | 204073.5825 | 1995-03-14 | 0 | | 5844161 | 203799.2870 | 1995-03-05 | 0 |

Calculations for estimating revenue would be, for example for the first group: 204080,1089 / (0.29 1 1)= 714280,381

Am I correct? The values for the original tables give: +------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +------------+-------------+-------------+----------------+ | 2456423 | 406181.0111 | 1995-03-05 | 0 | | 3459808 | 405838.6989 | 1995-03-04 | 0 | | 492164 | 390324.0610 | 1995-02-19 | 0 |

pyongjoo commented 5 years ago

This is an expected result. Since there are only a few tuples that include the same l_orderkey, the aggregates based on a small sample (say 1%) cannot produce accurate-enough results (as you have observed). In this situation, VerdictDB chooses to process almost all data to guarantee high accuracy.