Closed iangow closed 9 months ago
Tweaks to get code to work with database:
complete_cusip
into cusips
copy_to
to get local data frame to databaselength(unique(cusip_id))
to n_distinct(cusip_id)
bind_rows()
to union_all()
drop_na()
(could use a filter()
instead)na.rm = TRUE
to suppress warnings (not necessary)library(tidyverse)
library(DBI)
tidy_finance <- dbConnect(
duckdb::duckdb(),
"data/tidy_finance.duckdb",
read_only = TRUE)
mergent <- tbl(tidy_finance, "mergent")
cusips <- mergent %>% pull(complete_cusip)
bonds_outstanding <-
expand_grid(date = seq(ymd("2014-01-01"),
ymd("2016-11-30"),
by = "quarter"),
complete_cusip = cusips) |>
copy_to(tidy_finance, df = _, name = "cusips") |>
left_join(mergent |> select(complete_cusip,
offering_date,
maturity),
by = "complete_cusip") |>
mutate(offering_date = floor_date(offering_date),
maturity = floor_date(maturity)) |>
filter(date >= offering_date & date <= maturity) |>
count(date) |>
mutate(type = "Outstanding")
trace_enhanced <- tbl(tidy_finance, "trace_enhanced")
bonds_traded <-
trace_enhanced |>
mutate(date = floor_date(trd_exctn_dt, "quarters")) |>
group_by(date) |>
summarize(n = n_distinct(cusip_id),
type = "Traded",
.groups = "drop")
bonds_outstanding |>
union_all(bonds_traded) |>
ggplot(aes(
x = date,
y = n,
color = type,
linetype = type
)) +
geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Number of bonds outstanding and traded each quarter"
)
mergent |>
mutate(maturity = as.numeric(maturity - offering_date) / 365,
offering_amt = offering_amt / 10^3) |>
pivot_longer(cols = c(maturity, coupon, offering_amt),
names_to = "measure") |>
group_by(measure) |>
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
q05 = quantile(value, 0.05, na.rm = TRUE),
q50 = quantile(value, 0.50, na.rm = TRUE),
q95 = quantile(value, 0.95, na.rm = TRUE),
max = max(value, na.rm = TRUE)
)
#> # Source: SQL [3 x 8]
#> # Database: DuckDB 0.7.1 [root@Darwin 22.5.0:R 4.3.0/data/tidy_finance.duckdb]
#> measure mean sd min q05 q50 q95 max
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 coupon 4.15 3.72 0 0 4.88 9.62 39
#> 2 offering_amt 190. 419. 0 0.349 12.2 1000 15000
#> 3 maturity 7.29 8.20 0.00822 1.03 5.01 30.0 100.
trace_enhanced |>
group_by(trd_exctn_dt) |>
summarize(trade_size = sum(entrd_vol_qt * rptd_pr / 100, na.rm = TRUE) / 10^6,
trade_number = n(),
.groups = "drop") |>
pivot_longer(cols = c(trade_size, trade_number),
names_to = "measure") |>
group_by(measure) |>
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
q05 = quantile(value, 0.05, na.rm = TRUE),
q50 = quantile(value, 0.50, na.rm = TRUE),
q95 = quantile(value, 0.95, na.rm = TRUE),
max = max(value, na.rm = TRUE)
)
#> # Source: SQL [2 x 8]
#> # Database: DuckDB 0.7.1 [root@Darwin 22.5.0:R 4.3.0/data/tidy_finance.duckdb]
#> measure mean sd min q05 q50 q95 max
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 trade_number 25921. 5460. 438 17851. 26025 34458. 40889
#> 2 trade_size 12968. 3574. 17.2 6138. 13408. 17851. 20905.
dbDisconnect(tidy_finance, shutdown = TRUE)
Created on 2023-05-10 with reprex v2.0.2
Thank you for the extensive proposal and sorry for the delayed response. We decided to keep the code as it is, namely downloading the raw TRACE data and cleaning it in memory. We did not find a consistent performance improvement by moving the pipelines to the WRDS server when we coded the cleaning procedure. On the contrary, the current in-memory solution was actually faster. And optimizing the electricity bill of users is currently not on our agenda ;)
We also decided against using duckdb for aggregation because (i) not everybody might have access to duckdb as opposed to sqlite (e.g. companies) and (ii) one big advantage of duckdb (the postgres scanner extension) is not consistently available for Windows users.
Makes sense. A couple of notes:
compute()
(in SQL, CREATE TEMPORARY TABLE
) makes it very hard to have a good WRDS server–side solution. The query will get very big and unwieldy fast.Regarding DuckDB, I wonder how restrictive this is in practice. Both SQLite (install.packages("RSQLite")
) and DuckDB (install.packages("duckdb")
) are user-installed R packages (i.e., not part of base R). Perhaps more IT departments have white-listed SQLite for various reasons. I do note that DuckDB is 81MB (very big by CRAN standards) because of the way they put everything in the one package.
I think a more robust approach than what I proposed earlier might put the data in separate parquet files. However this adds a little overhead to users in terms of understanding how to load them and seems not right for the mass of readers of Tidy Finance.
For my own book, I made a different design choice. To keep each chapter independent, I do not have any local data storage and use the WRDS server a lot more (i.e., collect()
later). This has its own downsides, so I offer two alternative paths forward for users looking to store data. Both of these are in appendices.
The latter option yields a data repository of about 9GB that could be stuck in Dropbox and shared with co-authors. I suspect that anything using an SQLite (or DuckDB) database would make it difficult to share data with co-authors if there's any chance of writing and reading at the same time.
Why not have the WRDS PostgreSQL server do all the work?
Steps:
collect()
until the endnrow()
function to handle remote data frames.window_order()
in place ofarrange()
when used for window functionstrd_exctn_tm
(PostgreSQL wants a time zone … this could be sorted out later)Based on output in comment below, code seems to work fine. (Not sure if it delivers much performance benefit, but it saves on readers' electricity bills.)
Note that code in comment below does data aggregation in the database and is about 10 times faster than code using
collect()
.