Beastjoe commented 5 years ago

Here is the group size distribution on l_partkey and l_suppkey in tpch1g. The y axis is the number of groups of same size and the x-axis is the group size. I think it can be assumed as normal distribution.

The scrambling method will first create a temp table to count the group size, such as CREATE [temp table] AS SELECT [stratified column], count(*) from [original table] group by [stratified column] Then some statistical nodes will calculate the average and standard deviation of group size and count the size of rare group. After that, it will create scramble table using some similar logics used in fast converge method. Currently, the rare group(tier 0) is defined to be [average group size] - 1.0*[standard deviation]. It supports one stratified column.

pyongjoo commented 5 years ago

Can you tell me what are the differences from the existing FastConverge method?

Beastjoe commented 5 years ago

The main difference is FastConverge divides group using the value of columns, while Stratified method uses the size of group. For instance, if a column is numeric type, FastConverge will calculate the avg and stddev of this column and set the outlier group to be any value lower or higher than the boundary (calculated using avg(column) and std(column)).

Another difference is FastConverge will apply to every columns that are numeric, while Stratified only applies to the column users specify.

pyongjoo commented 5 years ago

I see. Thanks for the clarification.

codecov-io commented 5 years ago

Codecov Report

Merging #337 into master will increase coverage by 0.28%. The diff coverage is 77.65%.

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
+ Coverage    70.8%   71.08%   +0.28%     
==========================================
  Files         168      169       +1     
  Lines       11492    11708     +216     
  Branches     1874     1920      +46     
==========================================
+ Hits         8136     8321     +185     
- Misses       2875     2896      +21     
- Partials      481      491      +10

Impacted Files	Coverage Δ
...c/main/java/org/verdictdb/sqlsyntax/SqlSyntax.java	`75% <ø> (ø)`	:arrow_up:
...erdictdb/core/scrambling/ScramblingMethodBase.java	`89.48% <ø> (ø)`	:arrow_up:
.../verdictdb/core/execplan/ExecutableNodeRunner.java	`87.13% <ø> (+7.3%)`	:arrow_up:
.../main/java/org/verdictdb/sqlsyntax/HiveSyntax.java	`38.64% <0%> (-2.82%)`	:arrow_down:
...ain/java/org/verdictdb/sqlsyntax/SqliteSyntax.java	`12% <0%> (-0.5%)`	:arrow_down:
...ain/java/org/verdictdb/sqlsyntax/ImpalaSyntax.java	`67.5% <0%> (-5.47%)`	:arrow_down:
.../java/org/verdictdb/core/querying/ola/AggMeta.java	`90.75% <0%> (-0.92%)`	:arrow_down:
...in/java/org/verdictdb/core/sqlobject/ColumnOp.java	`62.21% <0%> (-0.99%)`	:arrow_down:
...main/java/org/verdictdb/sqlsyntax/SparkSyntax.java	`65% <0%> (-1.66%)`	:arrow_down:
...java/org/verdictdb/sqlsyntax/PostgresqlSyntax.java	`65.63% <0%> (-6.78%)`	:arrow_down:
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c4c2b1b...383dfb7. Read the comment docs.

Beastjoe commented 5 years ago

@pyongjoo @dongyoungy I have added the implementation of stratified sampling. So far, it will put all rare groups into block 0 and do uniform sampling in the remaining blocks. The syntax is like create scramble [scramble table] from [original table] METHOD stratified STRATIFIED BY [columns] RATIO [uniform sampling ratio] LEASTSAMPLINGSIZE [minimum sampling size]

Beastjoe commented 5 years ago

@dongyoungy Shouldn't block 0 only contain rows from rare groups? I guess you mean combing block 0 and block 1?

block0 has at least k*a rows for each group

dongyoungy commented 5 years ago

I thought the idea later proposed by Young was something like:

<tier0, block0> : all rows from rare groups
<tier1, block0> : rows from other non-rare groups

and all rare groups go tier0 and others continue with tier1?

Beastjoe commented 5 years ago

I see. I will change the logic.

Beastjoe commented 5 years ago

I have added the tests. For groups that are not rare(#row> k/a), I think we can only guarantee statistically that at least k rows are selected in each block.

dongyoungy commented 5 years ago

I think we can only guarantee statistically that at least k rows are selected in each block.

Is this because of `WHERE rand() < prob'?

Beastjoe commented 5 years ago

Yes

dongyoungy commented 5 years ago

Then I think we need to use different query to enforce at least K rows for such case.

I guess It could be something like (group-by omitted):

SELECT * 
FROM (SELECT *, row_number() as rn FROM table ORDER BY rand()) tmp
WHERE rn <= table_size * prob

Basically, randomly shuffle the original table for each group and pick top table_size * prob rows (which should be greater than K for non-rare groups). This should ensure that every group has at least K rows I think?

dongyoungy commented 5 years ago

pick top table_size * prob rows

To correct, I guess it should be group_size * prob rows as we are dealing with individual groups in a table.

Beastjoe commented 5 years ago

I just realized if we could use row_number() to enforce the row number of tier1 group, we can also enforce the row number of tier 0 group, like

SELECT * 
FROM (SELECT *, row_number() as rn FROM table ORDER BY rand()) tmp
WHERE rn <= K

where K is the minimum sampling size. Then, we can avoid putting too much rows in block 0. What do you think?

dongyoungy commented 5 years ago

I am not sure because then sampling probability for each group can be different for subsequent blocks in tier0? I guess we can discuss this after the meeting tomorrow.

Beastjoe commented 5 years ago

I have fixed all the unit tests. You can have a check. @pyongjoo @dongyoungy

dongyoungy commented 5 years ago

@pyongjoo I think this PR is almost ready to be merged, except it now requires MySQL 8.0 instead of 5.5 due to its use of row_number() func.

Please take a look briefly and make comments on this PR if necessary.

Beastjoe commented 5 years ago

@dongyoungy @pyongjoo I have changed the logic of stratified sampling based on Yongjoo's comment. So far, for groups that are not rare, we divides them into multiple tiers (10 tiers in default) and do uniform sampling. For instance, if user specify at least 100 rows for each group in each block, and sampling ratio is 0.1, block size is 100, then

For groups of which #row < 1000, we put them in first block
For groups of which #row < 10000, we also put them in first block
For groups of which 10000 < #row < 20000, we uniformly sample within first 10 blocks
For groups of which 20000 < #row < 30000, we uniformly sample within first 20 blocks ...
For groups of which 90000 < #row < 100000, we uniformly sample within first 90 blocks
For groups of which #row > 100000, we simply do uniform sampling

Yongjoo said it is okay if we can guarantee the expectation rows of each group in each block is at least k. So, row_number() is not used.

verdict-project / verdict

Joezhong stratified sampling #337

Codecov Report