scaling factor for join

verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing

http://verdictdb.org

Apache License 2.0

248 stars 66 forks source link

scaling factor for join #385

Closed qingzma closed 5 years ago

qingzma commented 5 years ago

I have a question about the scaling factor for join, especially for SUM and COUNT. To be specific, if I make two hash (or uniform) samples, I could get the sum and count for the joined sample, but not for the real join table. I need a scaling factor to scale up the value. If it is a simple query not involving join, I could use the sampling ratio (or the exact number of tuples ) to scale up the value. But for joins, I do not find any clues about this. (Sorry if I missed something on the paper.)

pyongjoo commented 5 years ago

Thanks for the question. The scaling factor depends on sample types.

A join of two uniform samples: The sampling ratio is the product of the sampling ratios of the two tables, i.e., p1 p2. Thus, the scaling factor to obtain an unbiased answer is 1 / (p1p2).
A join of two hash samples: The same sampling ratio is typically used for both tables. Say it is p. Thus, the scaling factor to obtain an unbiased answer is 1 / p.

qingzma commented 5 years ago

Thanks for the question. The scaling factor depends on sample types.

A join of two uniform samples: The sampling ratio is the product of the sampling ratios of the two tables, i.e., p1 p2. Thus, the scaling factor to obtain an unbiased answer is 1 / (p1p2).

A join of two hash samples: The same sampling ratio is typically used for both tables. Say it is p. Thus, the scaling factor to obtain an unbiased answer is 1 / p.

Thanks for the detailed reply!

I asked this question because when joining two samples using such scaling factors, sometimes I could not get satisfactory results. We doubt that 1 / (p1*p2) won't be accurate enough for sample joins.

So have you ever noticed any bad predictions for sample joins? Anyway, I will check the experiments again to make sure that no mistakes are made.

Thanks.

pyongjoo commented 5 years ago

Low accuracy when joining two or more uniform samples is a known problem. There is no magic solution other than increasing sampling ratios or using a different sampling strategy (e.g., hash sampling).

qingzma commented 5 years ago

I agree, these are explained in the SIGMOD99 and SIGMOD18 paper. Thanks for your help.