uwdata / mosaic

An extensible framework for linking databases and interactive views.
https://idl.uw.edu/mosaic
Other
755 stars 46 forks source link

Aggregation error when dynamically assigning columns #487

Closed mhkeller closed 2 weeks ago

mhkeller commented 3 weeks ago

I'm trying to make a regression chart where the x and y columns are assigned dynamically via a menu. I had this working circa version 0.7.1 but after updating to 0.10.0 I am getting an error about invalid aggregations. I tried downgrading but, mysteriously, the error remains.

I made a reproduction repo here using the athletes example: https://github.com/mhkeller/mosaic-aggr-error

Here's the error, which occurs when you hover or brush on the chart area:

> EXEC CREATE TEMP TABLE IF NOT EXISTS cube_index_ea68316b AS SELECT CAST(MIN("x") FILTER (WHERE ("y" IS NOT NULL)) AS DOUBLE) AS "x0", CAST(MAX("x") FILTER (WHERE ("y" IS NOT NULL)) AS DOUBLE) AS "x1", "sex", FLOOR(580::DOUBLE * ("height" - 1.21::DOUBLE))::INTEGER AS "active0", FLOOR(3.4769349124067936::DOUBLE * ("weight" - 20.767793662748627::DOUBLE))::INTEGER AS "active1", REGR_COUNT("y", "x") AS "__count_y_x__", REGR_AVGX("y", "x") AS "__avg_x_y__", REGR_AVGY("y", "x") AS "__avg_y_x__", SUM(("x" - (SELECT AVG("x") FROM "athletes")) * ("y" - (SELECT AVG("y") FROM "athletes"))) AS "__sxy_y_x__", SUM("x" - (SELECT AVG("x") FROM "athletes")) FILTER ("y" IS NOT NULL) AS "__rs_x__", SUM("y" - (SELECT AVG("y") FROM "athletes")) FILTER ("x" IS NOT NULL) AS "__rs_y__", SUM(("x" - (SELECT AVG("x") FROM "athletes")) ** 2) FILTER ("y" IS NOT NULL) AS "__rss_x__", SUM(("y" - (SELECT AVG("y") FROM "athletes")) ** 2) FILTER ("x" IS NOT NULL) AS "__rss_y__" FROM (SELECT "height" AS "x", "weight" AS "y", "sex", "height", "weight" FROM "athletes" AS "source") GROUP BY "sex", "active0", "active1"
[Error: Binder Error: aggregate function calls cannot be nested
LINE 1: ...) AS "__avg_y_x__", SUM(("x" - (SELECT AVG("x") FROM "athletes")) * ("y" - (SE...
                                                  ^] {
  errno: -1,
  code: 'DUCKDB_NODEJS_ERROR',
  errorType: 'Binder'
}
REQUEST 3.5

If I hard-code the x and y column names, everything works fine when I brush.

Here is the query that is executed when the column names are hard-coded:

EXEC CREATE TEMP TABLE IF NOT EXISTS cube_index_5f5f1614 AS SELECT CAST(MIN("height") FILTER (WHERE ("weight" IS NOT NULL)) AS DOUBLE) AS "x0", CAST(MAX("height") FILTER (WHERE ("weight" IS NOT NULL)) AS DOUBLE) AS "x1", "sex", FLOOR(580::DOUBLE * ("height" - 1.21::DOUBLE))::INTEGER AS "active0", FLOOR(3.4769349124067936::DOUBLE * ("weight" - 20.767793662748627::DOUBLE))::INTEGER AS "active1", REGR_COUNT("weight", "height") AS "__count_weight_height__", REGR_AVGX("weight", "height") AS "__avg_height_weight__", REGR_AVGY("weight", "height") AS "__avg_weight_height__", SUM(("height" - (SELECT AVG("height") FROM "athletes")) * ("weight" - (SELECT AVG("weight") FROM "athletes"))) AS "__sxy_weight_height__", SUM("height" - (SELECT AVG("height") FROM "athletes")) FILTER ("weight" IS NOT NULL) AS "__rs_height__", SUM("weight" - (SELECT AVG("weight") FROM "athletes")) FILTER ("height" IS NOT NULL) AS "__rs_weight__", SUM(("height" - (SELECT AVG("height") FROM "athletes")) ** 2) FILTER ("weight" IS NOT NULL) AS "__rss_height__", SUM(("weight" - (SELECT AVG("weight") FROM "athletes")) ** 2) FILTER ("height" IS NOT NULL) AS "__rss_weight__" FROM (SELECT "sex", "height", "weight" FROM "athletes" AS "source") GROUP BY "sex", "active0", "active1"

Here's a diff of the two queries: https://www.diffchecker.com/nF0AQG1R/

It seems the only thing that's different is that the query from the dynamically generated scenario renames the columns to "x" and "y". The part of the query it's failing on is: SELECT AVG("x") FROM "athletes") which makes sense to me since the alias "x" doesn't exist on the athletes table, only in the FROM clause at the end (SELECT "height" AS "x", "weight" AS "y", "sex", "height", "weight" FROM "athletes" AS "source").

(edit: After thinking about this and the specific error message, the error may be happening before the select portion of the query is evaluated. If it was complaining about "x" not existing, it would be a different error message).

I did a small tweak where I do this at the beginning

with t as (
    SELECT 
      "height" AS x, 
      "weight" AS y, 
      "sex"
    FROM 
      "athletes" AS "source"
  )

and then change the references in the subqueries to be FROM t. That fixes the error although the numbers are slightly different from the query with hard-coded column names. That's as far as I got. I'm not sure why the numbers would be slightly off but maybe I messed up something in the syntax. It seems that avoiding references to the original table, though, when column names are being aliased could be a good fix, but my knowledge of Mosaic internals is low.

jheer commented 3 weeks ago

Thanks for the thorough reproduction! This allowed me to quickly track down the problem. I have a fix in #491, along with additional changes that should make your example work as it should.

mhkeller commented 3 weeks ago

Thanks a bunch for writing a fix so quickly! I'm glad the repro was helpful – it's been on my list for a while to make an issue for this and I finally had some time last night to get it all documented.