verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Scaling error for SUM and COUNT #361

Closed hychen20 closed 5 years ago

hychen20 commented 5 years ago

I created a scramble table with uniform sampling and fraction 0.1 for tpch1g. However, for tpch Q1, the results are as follows:

image

For SUM and COUNT, the answers are affected by the fraction I chose. Do I have to scale the verdict answer by 10 to get the real AQP answer?

pyongjoo commented 5 years ago

No, this is not supposed to happen.

Can you kindly post what queries you have issued for creating scramble tables and issuing queries?

hychen20 commented 5 years ago

Create lineitem_scramble

CREATE SCRAMBLE tpch1g.lineitem_scramble FROM tpch1g.lineitem method uniform ratio 0.1 blocksize 6002;

and Q1

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <= date '1998-12-01' - interval '66' day
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus

and Q1 for VerdictDB

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem_scramble
where
    l_shipdate <= date '1998-12-01' - interval '66' day
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus
hychen20 commented 5 years ago

I'm using Spark and have a spark job for testing it.

pyongjoo commented 5 years ago

Can you also tell me the verdictdb version you included?

I think now the verdict query must include the scramble table itself, not the original table name. It’s unexpected that the table replacement is still performed.

We will check the details tomorrow btw since it’s midnight here.

hychen20 commented 5 years ago

It's verdict core 0.5.8. Sorry I forgot to include the scramble table so I re-edit it above.

I also notice for some other TPCH queries (such Q4), the answers need to be scale by 100. It might because there are joins over two sum/count columns. I'm going to have a look.

Tomorrow is fine. Have a good night.

hychen20 commented 5 years ago

I see the reason. I accidentally dropped the VerdictDB metadata ...