verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Add support for 'scramble appending' (issue #302) #327

Closed dongyoungy closed 5 years ago

dongyoungy commented 5 years ago

(Resolve #302. This PR is based on PR #326) Added 'INSERT SCRAMBLE...' syntax to support scramble appending.

The current implementation supports the statement of the following form:

INSERT SCRAMBLE <existing_scramble_table_schema>.<existing_scramble_table_name> WHERE <condition>

This command will append data into existing scramble using same scrambling method + source data that satisfy the given condition.

As a bonus feature, I also added conditional scramble creation like the following:

CREATE SCRAMBLE <scramble_table> FROM <original_table> [WHERE <condition>]

The current limitations for appending scramble are as follows:

  1. New scramble data is appended ONLY from the original source table --> this is because current verdictdb metadata for a scramble has a mapping to its single source table. We may remove this mapping in the future, which can allow a scramble generated from multiple source tables.
  2. Appending scramble will not work on scrambles built prior to this version as they do not have necessary information in their metadata to create same type of scramble on new data.
  3. It is users' responsibility to make sure not to append duplicates into an existing scramble.

For current implementation, it is required to store classes implementing ScramblingMethod interface (or ScrambilngMethodBase abstract class) as a part of ScrambleMeta in JSON, thus, necessary changes are made to make them serializable (e.g., removing the use of 'Optional' in FastConvergeScramblingMethod).

codecov-io commented 5 years ago

Codecov Report

Merging #327 into master will increase coverage by 0.17%. The diff coverage is 78.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #327      +/-   ##
==========================================
+ Coverage   70.87%   71.03%   +0.17%     
==========================================
  Files         167      168       +1     
  Lines       11233    11485     +252     
  Branches     1847     1873      +26     
==========================================
+ Hits         7960     8157     +197     
- Misses       2813     2853      +40     
- Partials      460      475      +15
Impacted Files Coverage Δ
.../verdictdb/sqlreader/ScramblingQueryGenerator.java 0% <ø> (ø) :arrow_up:
.../org/verdictdb/core/scrambling/ScramblingPlan.java 100% <100%> (ø) :arrow_up:
...erdictdb/core/sqlobject/InsertIntoSelectQuery.java 100% <100%> (ø)
...java/org/verdictdb/core/sqlobject/SelectQuery.java 94.5% <100%> (+0.06%) :arrow_up:
.../org/verdictdb/core/scrambling/ScramblingNode.java 94.19% <100%> (+0.07%) :arrow_up:
...va/org/verdictdb/core/scrambling/ScrambleMeta.java 82.98% <100%> (+2.26%) :arrow_up:
...ctdb/core/scrambling/CreateScrambledTableNode.java 73.59% <100%> (+0.51%) :arrow_up:
...erdictdb/core/scrambling/HashScramblingMethod.java 60% <37.04%> (-17.55%) :arrow_down:
...ava/org/verdictdb/metastore/ScrambleMetaStore.java 91.12% <55.56%> (-6.27%) :arrow_down:
.../main/java/org/verdictdb/sqlwriter/QueryToSql.java 68% <66.67%> (+4.37%) :arrow_up:
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 58306b8...e39e3f6. Read the comment docs.

dongyoungy commented 5 years ago

I made changes so that:

  1. it now uses 'cumulativeDistributions' metadata (i.e., storedProbDist in ScrambleMethodBase) to figure out block sizes for existing scrambles from prior versions of VerdictDB.
  2. a user can use either "INSERT SCRAMBLE ..." or "APPEND SCRAMBLE ..."
dongyoungy commented 5 years ago

Removed (WIP) from the title since I think this PR is ready to be reviewed.