vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
317 stars 17 forks source link

Consider computing bin edge calculation #468

Open jonmmease opened 7 months ago

jonmmease commented 7 months ago

I started playing with implementing the Vega/D3 automatic bin selection algorithm in SQL. I got pretty close with this (in duckdb)

-- [duckdb]
WITH
  -- Input table
  "tbl_a" as (
    SELECT * FROM (VALUES (-2.9), (0), (23.5)) as tbl(col_value)
  ),
  -- Compute raw min and max
  "tbl_b" as (
    SELECT MIN(col_value) as "min_", MAX(col_value) as "max_"
    FROM "tbl_a"
  ),
  -- Config and scalar calculations
  "tbl_0" AS (
    SELECT 
      -- config
      10 as base
      , 20 as maxbins
      , 0 as minstep
      -- pass through
      , min_
      , max_
      -- calculation
      , ln(base) as logb
      , max_ - min_ as span
      , ceil(ln(maxbins) / logb) as level
      , greatest(minstep, pow(base, round(ln(span) / logb) - level)) as step
      FROM "tbl_b"
  ),
  -- Increase step size if too many bins
  "tbl_1" as (
    SELECT logb, maxbins, minstep, base, min_, max_, span
      , ceil(span / (step * pow(base, base_factor))) as num_bins
      , step * pow(base, base_factor) as step
    FROM "tbl_0"
    CROSS JOIN (SELECT * FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as tbl(base_factor))
    WHERE num_bins <= maxbins
    ORDER BY num_bins DESC
    LIMIT 1
  ),
  -- Decrease step size (based on divide of [5, 2]) if allowed
  "tbl_2" as (
    SELECT base, num_bins, min_, max_, logb, step / div as step
    FROM "tbl_1"
    CROSS JOIN (SELECT * FROM (VALUES (5), (2), (1)) as tbl(div))
    WHERE step >= minstep AND span / step <= maxbins
    ORDER BY step
    LIMIT 1
  ),
  -- Update precision of min_ and max_
  "tbl_3" as (
    SELECT base, step, num_bins, min_, max_
      , ln(step) as v, CASE WHEN v >= 0.0 THEN 0.0 ELSE floor(-v / logb) + 1.0 END as precision
      , pow(base, -precision - 1.0) as eps
      , floor(min_ / step + eps) * step as v1
      , CASE WHEN min_ < v1 THEN v1 - step ELSE v1 END as final_min
      , ceil(max_ / step) * step as final_max2
      , final_min as start
      , CASE WHEN final_max2 <> final_min THEN final_max2 ELSE final_min + step END as stop

    FROM "tbl_2"
  ),
  -- Add new binned columns to input table
  "tbl_4" as (
    SELECT start, stop, step, col_value
      , start + floor((col_value - start) / (step)) * step as bin_start
      , bin_start + step as bin_end
    from "tbl_a"
    CROSS JOIN "tbl_3"
  )
SELECT * FROM "tbl_4"
|    |   start |   stop |   step |   col_value |   bin_start |   bin_end |
|---:|--------:|-------:|-------:|------------:|------------:|----------:|
|  0 |      -4 |     24 |      2 |        -2.9 |          -4 |        -2 |
|  1 |      -4 |     24 |      2 |         0   |           0 |         2 |
|  2 |      -4 |     24 |      2 |        23.5 |          22 |        24 |

In place of the while loop in the "Increase step size if too many bins" section of the algorithm, I added a fixed number of iterations and selected the first one that satisfies the criteria. I'm not sure how best to select this threshold, but I don't think in needs to be very high (and it shouldn't be very expensive).

In the case where the bin transform is required to output a signal, we would need to perform this calculation in Rust anyway, but for the case where it doesn't need to output a signal, it might be benefitial to avoid performing a separate query to compute the bin edges.