The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?
The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra array_agg/unnest steps I'm taking to be able to keep track of the data.
Query 1:
Here is a window function
CREATE OR REPLACE FUNCTION public.winsorize(
double precision,
double precision)
RETURNS double precision AS
$BODY$
library(psych)
return(winsor(as.vector(farg1), arg2)[prownum])
$BODY$ LANGUAGE plr WINDOW ;
Here is a query using this window function:
WITH raw_data AS (
SELECT gvkey, datadate, date_part('year', datadate) AS year,
CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
FROM comp.funda
WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
ORDER BY gvkey, datadate
LIMIT 100000)
SELECT year, gvkey, datadate, roe,
winsorize(roe, 0.05) OVER (PARTITION BY year) AS roe_w
FROM raw_data
WHERE roe IS NOT NULL
ORDER BY year, roe;
Query 2:
Here is an alternative version of the query above in which the "windows" are constructed by hand.
WITH raw_data AS (
SELECT gvkey, datadate, date_part('year', datadate) AS year,
CASE WHEN lag(ceq) OVER w > 0 THEN ni/lag(ceq) OVER w END AS roe
FROM comp.funda
WINDOW w AS (PARTITION BY gvkey ORDER BY datadate)
ORDER BY gvkey, datadate
LIMIT 100000),
intermediate AS (
SELECT year,
array_agg(gvkey ORDER BY roe) AS gvkeys,
array_agg(datadate ORDER BY roe) AS datadates,
array_agg(roe ORDER BY roe) AS roes,
winsorize_vec(array_agg(roe ORDER BY roe), 0.05) AS roe_ws
FROM raw_data
WHERE roe IS NOT NULL
GROUP BY year)
SELECT year, unnest(gvkeys) AS gvkey,
unnest(datadates) AS datadate,
unnest(roes) AS roe,
unnest(roe_ws) AS roe_w
FROM intermediate
ORDER BY year, roe;
The query is using this function:
CREATE OR REPLACE FUNCTION public.winsorize_vec(
float8[],
double precision)
RETURNS float8[] AS
$BODY$
library(psych)
return(winsor(as.vector(arg1), arg2))
$BODY$ LANGUAGE plr;
One idea I had was to create an SQL function like this:
CREATE OR REPLACE FUNCTION public.winsorize_sql(
double precision,
double precision)
RETURNS double precision AS
$BODY$
SELECT unnest(winsorize_vec(array_agg($1), $2))
$BODY$ LANGUAGE sql WINDOW
This compiles and I can use it in a query, but I get NULL as the return value every time.
The performance of PL/R window functions is fairly poor. By building the "windows" by hand and pass them to R as vectors, I can get much better performance. Would it be possible for PL/R to use a similar approach to constructing windows?
The two queries below give the same results, but the first takes 37 seconds and the second 6 seconds. The performance of the second is even better (about 3 seconds) if I eliminate the extra
array_agg
/unnest
steps I'm taking to be able to keep track of the data.Query 1:
Here is a window function
Here is a query using this window function:
Query 2: Here is an alternative version of the query above in which the "windows" are constructed by hand.
The query is using this function:
One idea I had was to create an SQL function like this:
This compiles and I can use it in a query, but I get
NULL
as the return value every time.