verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Scramble Appending #302

Closed commercial-hippie closed 5 years ago

commercial-hippie commented 5 years ago

Will there a be way to append to a scramble in the future?

ie. if we have a table which has data added on a daily basis, do we need to drop the scramble and re-create?

I was thinking we might be able to just copy the query used to create the scramble manually (select from) and do a INSERT (SELECT FROM WHERE 'new data conditions').

Could that work or would it mess with the calculations Verdict does?

Thanks! Mike

pyongjoo commented 5 years ago

Supporting it will be straightforward, although we don't have it right now.

This may require inserting some extra metadata (into a verdict-managed table) to record the sizes of the original tables. To see the reason, suppose two tables A and B. Suppose we chose 100 tuples of A's 1000 tuples; and we chose 100 tuples out of B's 2000 tuples. Then, a higher sampling probability (100/1000) is used for A in comparison to B (100/2000); thus, we need to correct this bias.

Please let us know if you have decided to use daily insertions (e.g., as a new partition or so). I tentatively label this issue as 'feature request'.

commercial-hippie commented 5 years ago

@pyongjoo we will definitely use daily insertions when it becomes an available feature.. I might look into doing this manually in about 2 weeks.

I was thinking of just cloning the original scramble insert query (create table as select from).. And do something like:

  1. Insert the data - INSERT INTO verdictdb_scrambles.table_name SELECT FROM (copied from original insert query) WHERE data > date_since_last_update.
  2. Update the table data in the verdictdbmeta table.

Would that work?

Our schemas or scramble sizes wont be changing so doing it manually for now is not a problem for me. :smile:

pyongjoo commented 5 years ago

That will certainly work, but I don't think that implementing the same logic inside Verdict won't be difficult as well.

Let me have some discussions with @dongyoungy about its implementation plan.