prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.04k stars 5.37k forks source link

Why force `long` type for approx_percentile weights? #12794

Closed MichaelChirico closed 3 years ago

MichaelChirico commented 5 years ago

From approx_percentile(x, w, percent):

The weight must be an integer value of at least one. It is effectively a replication count for the value x in the percentile set.

I don't understand well the implementation of approx_percentile in the case of using weights. IIUC it's driven by the Q-digest data type from this paper: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf

But I don't see any mention of incorporating weights there; I poked around in presto source code but as near as I can tell it's being sent upstream to airlift and I don't understand what's going on there

It's very common to have double weights (e.g. I guess most users are accustomed to normalizing weights to sum to one; the case that brought me to post this was about using an exponential weighting kernel in geospatial setting where weights are between 0 and 1, but can sum to anything). Is there anything preventing the implementation to use double weights?

It's of course possible to kludge this by doing cast(pow(10, k)*double_weight, bigint) for some k of your choosing (or pow(2, k)); this introduces some noise to the estimation, but the function is already called approx_percentile.

I did the following to explore this approach over a variety of data sizes in terms of (1) accuracy as a function of k and (2) impact of k on timing. Basically I found that k can increase timing a fair amount but the impact on accuracy is hard to notice in extremely toy example:

## QUERYING PRESTO FROM R
library(data.table)
library(microbenchmark)

presto_driver = RJDBC::JDBC(
  "com.facebook.presto.jdbc.PrestoDriver", 
  "presto-jdbc-0.206.jar"
)

presto_conn = DBI::dbConnect(presto_driver, URL, USER, PASSWORD)

presto = function(query, param_list) {
  DBI::dbGetQuery(presto_conn, glue::glue_data(param_list, query))
}

iterations = CJ(n_rows1 = 10^(2:4),
                n_rows2 = 10^(0:3),
                n_round = 10^(2:6))
iterations = iterations[n_rows2 == 1 | n_rows1 == 1e4]

iterations[ , n_rows := n_rows1*n_rows2]
iterations[ , n_rows_l10 := log10(n_rows)]
iterations[ , n_round_l10 := log10(n_round)]

iterations[ , n_rows1 := trimws(sprintf('%20.0f', n_rows1))]
iterations[ , n_rows2 := trimws(sprintf('%20.0f', n_rows2))]
iterations[ , n_round := trimws(sprintf('%20.0f', n_round))]

iterations = rbindlist(
  replicate(10L, iterations, simplify = FALSE), 
  idcol = 'replication'
)

out_list = vector('list', nrow(iterations))
for (ii in ii:nrow(iterations)) {
  print(ii)
  t0 = get_nanotime()
  qrun = presto("
  select id, 
         approx_percentile(
           value, 
           if(cast({n_round}*weight as bigint) = 0, 1, 
              cast({n_round}*weight as bigint)), 
           array[.1, .4, .6, .9]
         ) as wtd_quantiles
  from (
    select random(10) id, random() value, random() weight
    from (select 1) a 
      cross join unnest (sequence(1, {n_rows1}))
      cross join unnest (sequence(1, {n_rows2}))
  )
  group by id order by id
  ", iterations[ii])
  qrun[ , iter_number := ii]
  qrun[ , time_taken := (get_nanotime() - t0)/1e9]
  out_list[[ii]] = qrun
}
output = rbindlist(out_list)

output[ , paste0('q_', c(10, 40, 60, 90)) := 
          tstrsplit(gsub('[][]', '', wtd_quantiles), ', ', 
                    fixed = TRUE, type.convert = TRUE)]
output[ , wtd_quantiles := NULL]

output[ , names(iterations) := iterations[iter_number]]
output[ , c('n_rows1', 'n_rows2') := NULL]
output[ , n_round_l10 := log10(n_round)]

# n_round matters for timing
output[n_rows > 10000, {
  boxplot(I(time_taken/60) ~ n_round_l10 + n_rows_l10, 
          sep = '_', log = 'y', las = 2L,
          ylab = 'Evaluation Time (m)',
          xlab = 'log10(n_round)_log10(n_rows)',
          main = 'Impact of Rounding Factor on Evaluation Time')
  abline(v = c(5.5, 10.5))
  NULL
}]

# n_round not so important for accuracy (caveat: very simple case)
output[ , {
  boxplot(q_40 ~ n_round_l10 + n_rows_l10, 
          sep = '_', las = 2L, notch = TRUE,
          ylab = 'Estimated 40th Percentile',
          xlab = 'log10(n_round)_log10(n_rows)',
          main = 'Impact of Rounding Factor on Estimate Accuracy')
  abline(v = 5*(1:5) + .5)
  NULL
}]
Screen Shot 2019-05-12 at 10 48 33 AM Screen Shot 2019-05-12 at 10 48 47 AM

CSV as TXT

Took quite a while to run so saving the timings here for future reference

approx_percentiles_timings.txt

I suppose it'd be appropriate to do a more formal statistical analysis of the nature of the error introduced by this rounding approach before building it into approx_percentile...

Long story short, (1) Is it possible to adjust the current algorithm for weighted percentiles to accept double weights? (2) If not, is it possible to kludge the function into doing so by either internally applying some rounding or exposing a new parameter to dictate the degree of rounding?


(PS, I recognize the way of generating dummy tables of a given size with cross join () cross join () is a bit clunky... any suggestion for improving this?)

tdcmeehan commented 5 years ago

I think it's that way for historical reasons, and based on the conversation in the linked issue above it confirms it. We'll be happy to make the change here once it's available in airlift.

MichaelChirico commented 5 years ago

airlift is now done

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.