jconway / plr

PL/R - R Procedural Language for PostgreSQL
http://www.joeconway.com
90 stars 41 forks source link

Window function performance seems lacking. #28

Open iangow opened 7 years ago

iangow commented 7 years ago

The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?

The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra array_agg/unnest steps I'm taking to be able to keep track of the data.

Query 1:

Here is a window function

CREATE OR REPLACE FUNCTION public.winsorize(
    double precision,
    double precision)
  RETURNS double precision AS
$BODY$
    library(psych)
    return(winsor(as.vector(farg1), arg2)[prownum])
$BODY$ LANGUAGE plr WINDOW ;

Here is a query using this window function:

WITH raw_data AS (
    SELECT gvkey, datadate, date_part('year', datadate) AS year,
        CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
    FROM comp.funda
    WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
    ORDER BY gvkey, datadate
    LIMIT 100000)

SELECT year, gvkey, datadate, roe,
    winsorize(roe, 0.05) OVER (PARTITION BY year) AS roe_w
FROM raw_data
WHERE roe IS NOT NULL
ORDER BY year, roe;

Query 2: Here is an alternative version of the query above in which the "windows" are constructed by hand.

WITH raw_data AS (
    SELECT gvkey, datadate, date_part('year', datadate) AS year,
        CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
    FROM comp.funda
    WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
    ORDER BY gvkey, datadate
    LIMIT 100000),

intermediate AS (
    SELECT year, 
    array_agg(gvkey ORDER BY roe) AS gvkeys,
    array_agg(datadate ORDER BY roe) AS datadates,
    array_agg(roe ORDER BY roe) AS roes,
    winsorize_vec(array_agg(roe ORDER BY roe), 0.05) AS roe_ws
    FROM raw_data
    WHERE roe IS NOT NULL
    GROUP BY year)

SELECT year, unnest(gvkeys) AS gvkey,
    unnest(datadates) AS datadate,
    unnest(roes) AS roe,
    unnest(roe_ws) AS roe_w
FROM intermediate
ORDER BY year, roe;

The query is using this function:

CREATE OR REPLACE FUNCTION public.winsorize_vec(
    float8[],
    double precision)
  RETURNS float8[] AS
$BODY$
    library(psych)
    return(winsor(as.vector(arg1), arg2))
$BODY$ LANGUAGE plr; 

One idea I had was to create an SQL function like this:

CREATE OR REPLACE FUNCTION public.winsorize_sql(
    double precision,
    double precision)
  RETURNS double precision AS
$BODY$
     SELECT unnest(winsorize_vec(array_agg($1), $2))
$BODY$ LANGUAGE sql WINDOW 

This compiles and I can use it in a query, but I get NULL as the return value every time.