Closed Beastjoe closed 5 years ago
Can you tell me what are the differences from the existing FastConverge method?
The main difference is FastConverge divides group using the value of columns, while Stratified method uses the size of group. For instance, if a column is numeric type, FastConverge will calculate the avg and stddev of this column and set the outlier group to be any value lower or higher than the boundary (calculated using avg(column) and std(column)).
Another difference is FastConverge will apply to every columns that are numeric, while Stratified only applies to the column users specify.
I see. Thanks for the clarification.
Merging #337 into master will increase coverage by
0.28%
. The diff coverage is77.65%
.
@@ Coverage Diff @@
## master #337 +/- ##
==========================================
+ Coverage 70.8% 71.08% +0.28%
==========================================
Files 168 169 +1
Lines 11492 11708 +216
Branches 1874 1920 +46
==========================================
+ Hits 8136 8321 +185
- Misses 2875 2896 +21
- Partials 481 491 +10
Impacted Files | Coverage Δ | |
---|---|---|
...c/main/java/org/verdictdb/sqlsyntax/SqlSyntax.java | 75% <ø> (ø) |
:arrow_up: |
...erdictdb/core/scrambling/ScramblingMethodBase.java | 89.48% <ø> (ø) |
:arrow_up: |
.../verdictdb/core/execplan/ExecutableNodeRunner.java | 87.13% <ø> (+7.3%) |
:arrow_up: |
.../main/java/org/verdictdb/sqlsyntax/HiveSyntax.java | 38.64% <0%> (-2.82%) |
:arrow_down: |
...ain/java/org/verdictdb/sqlsyntax/SqliteSyntax.java | 12% <0%> (-0.5%) |
:arrow_down: |
...ain/java/org/verdictdb/sqlsyntax/ImpalaSyntax.java | 67.5% <0%> (-5.47%) |
:arrow_down: |
.../java/org/verdictdb/core/querying/ola/AggMeta.java | 90.75% <0%> (-0.92%) |
:arrow_down: |
...in/java/org/verdictdb/core/sqlobject/ColumnOp.java | 62.21% <0%> (-0.99%) |
:arrow_down: |
...main/java/org/verdictdb/sqlsyntax/SparkSyntax.java | 65% <0%> (-1.66%) |
:arrow_down: |
...java/org/verdictdb/sqlsyntax/PostgresqlSyntax.java | 65.63% <0%> (-6.78%) |
:arrow_down: |
... and 43 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c4c2b1b...383dfb7. Read the comment docs.
@pyongjoo @dongyoungy
I have added the implementation of stratified sampling. So far, it will put all rare groups into block 0 and do uniform sampling in the remaining blocks. The syntax is like
create scramble [scramble table] from [original table] METHOD stratified STRATIFIED BY [columns] RATIO [uniform sampling ratio] LEASTSAMPLINGSIZE [minimum sampling size]
@dongyoungy Shouldn't block 0 only contain rows from rare groups? I guess you mean combing block 0 and block 1?
I thought the idea later proposed by Young was something like:
and all rare groups go tier0 and others continue with tier1?
I see. I will change the logic.
I have added the tests. For groups that are not rare(#row> k/a), I think we can only guarantee statistically that at least k rows are selected in each block.
I think we can only guarantee statistically that at least k rows are selected in each block.
Is this because of `WHERE rand() < prob'?
Yes
Then I think we need to use different query to enforce at least K rows for such case.
I guess It could be something like (group-by omitted):
SELECT *
FROM (SELECT *, row_number() as rn FROM table ORDER BY rand()) tmp
WHERE rn <= table_size * prob
Basically, randomly shuffle the original table for each group and pick top table_size * prob
rows (which should be greater than K for non-rare groups). This should ensure that every group has at least K rows I think?
pick top
table_size * prob
rows
To correct, I guess it should be group_size * prob
rows as we are dealing with individual groups in a table.
I just realized if we could use row_number() to enforce the row number of tier1 group, we can also enforce the row number of tier 0 group, like
SELECT *
FROM (SELECT *, row_number() as rn FROM table ORDER BY rand()) tmp
WHERE rn <= K
where K is the minimum sampling size. Then, we can avoid putting too much rows in block 0. What do you think?
I am not sure because then sampling probability for each group can be different for subsequent blocks in tier0? I guess we can discuss this after the meeting tomorrow.
I have fixed all the unit tests. You can have a check. @pyongjoo @dongyoungy
@pyongjoo I think this PR is almost ready to be merged, except it now requires MySQL 8.0 instead of 5.5 due to its use of row_number() func.
Please take a look briefly and make comments on this PR if necessary.
@dongyoungy @pyongjoo I have changed the logic of stratified sampling based on Yongjoo's comment. So far, for groups that are not rare, we divides them into multiple tiers (10 tiers in default) and do uniform sampling. For instance, if user specify at least 100 rows for each group in each block, and sampling ratio is 0.1, block size is 100, then
Yongjoo said it is okay if we can guarantee the expectation rows of each group in each block is at least k. So, row_number() is not used.
Here is the group size distribution on l_partkey and l_suppkey in tpch1g. The y axis is the number of groups of same size and the x-axis is the group size. I think it can be assumed as normal distribution.
The scrambling method will first create a temp table to count the group size, such as
CREATE [temp table] AS SELECT [stratified column], count(*) from [original table] group by [stratified column]
Then some statistical nodes will calculate the average and standard deviation of group size and count the size of rare group. After that, it will create scramble table using some similar logics used in fast converge method. Currently, the rare group(tier 0) is defined to be [average group size] - 1.0*[standard deviation]. It supports one stratified column.