verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Scrambling: incorrect statistics given extra predicates #367

Closed pyongjoo closed 5 years ago

pyongjoo commented 5 years ago

Some conversation from Gitter:

Michael @commercial-hippie May 14 01:22
When you create a scramble with a condition verdict still grabs the count on the entire table.. ie.

select count(*) as "verdictdbtotalcount" from "default"."test_table_orc"
Shouldn't the count be only for the conditions specified in the create scramble statement?

Sanjay Kumar @sanjay-94 May 14 01:27
adding on to @commercial-hippie, are the counts necessary to be calculated for the entire table even if i request for 1 partition's scramble to be created?
Yongjoo Park @pyongjoo May 14 14:50
@commercial-hippie What you mentioned--the count should only be for the subset--sounds very reasonable. I should check the code.
Dan Voyce @voycey 00:09
Hey guys - sorry only just made my way into this! As the guys above said this is pretty crucial to us at the moment - our data is reaching 300B rows soon and running a count across all of this is painful - we have been investigating ways around some of the bottlenecks but removing unnecessary full table scans is definitely a must have for us now :)
Yongjoo Park @pyongjoo 12:47
@voycey Thanks for the info. Let us look into it as well.
pyongjoo commented 5 years ago

@dongyoungy Can you take a look at this line?

Its calculation is based on the effective number of rows without considering a predicate, which I don't think is the correct approach. Let me know what you think. Thanks!

dongyoungy commented 5 years ago

As it also says // note that this tableSize does not account for filter condition (WHERE clause) in the source, predicates are not currently considered when creating scrambles.

The fix should be as simple as assigning tableSize with the number of rows in the table after filtering with the predicate while what only matters is how to pass that information effectively.

I will take a look and let you know.

pyongjoo commented 5 years ago

You are right. I saw the comment. I wanted to double-check with you if the fix you suggested would be the correct approach. Let me know if you think including the predicate for tableSize makes sense. Then, I will work on the fix.

dongyoungy commented 5 years ago

I think the suggested approach should work. IIRC, I think it should have taken predicates into account already when we added 'APPEND SCRAMBLE' syntax with conditions, but it was neglected at the time (because it should work when the table is large anyway if we do not consider performance) and I think I wrote the comment because of that.

voycey commented 5 years ago

because it should work when the table is large anyway if we do not consider performance

It definitely does, on Hive this isn't a problem because it can use the generated table statistics, however Presto doesn't use these things and has to read through the table to get the count - currently this takes 30 mins+

Does this count need to be 100% accurate? Predicates will help for sure but I can also foresee a situation where even a partition count will eventually be fairly slow!

pyongjoo commented 5 years ago

The count doesn't have to be 100% accurate, but I'm not sure if there is a general approach to getting an approximate count.

voycey commented 5 years ago

There is different syntax depending on which db is in use and whether they have that ability (Presto and BigQuery do!)

On Sat, 18 May 2019, 04:17 Yongjoo Park, notifications@github.com wrote:

The count doesn't have to be 100% accurate, but I'm not sure if there is a general approach to getting an approximate count.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/367?email_source=notifications&email_token=AAIEBCS65Y6OD2BKIJ4PARLPV3ZENA5CNFSM4HNOUVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVVPURI#issuecomment-493550149, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIEBCV5SB4HI7S45GCDEELPV3ZENANCNFSM4HNOUVIQ .

pyongjoo commented 5 years ago

The fix has been merged to master, so I close this issue.