Closed iangow closed 5 years ago
Here is some untested(!) R code for one of the two subsidiary tables:
library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
pg <- dbConnect(RPostgres:Postgres())
dbExecute(pg, "SET work_mem TO '3GB'")
# Define PG data ----
# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")
calls_raw <- tbl(pg, "calls_raw"))
rs <- dbExecute(pg, "DROP TABLE IF EXISTS calls")
calls <-
calls_raw %>%
semi_join(file_list, by = c("file_path", "file_name", "last_update")) %>%
select(-file_path, -sha1, -company_id, -cusip, -sedol, -isin) %>%
distinct() %>%
compute(name = "calls", temporary = FALSE)
# Should check that calls are uniquely identified by file_name and
# last_update
calls %>%
group_by(file_name, last_update) %>%
filter(n() > 1) %>%
ungroup() %>%
count()
db_comment <- paste0("CREATED USING create_calls.R from ",
"GitHub iangow/se_core ON ", Sys.time())
dbGetQuery(pg, paste0("COMMENT ON TABLE calls IS '",
db_comment, "';"))
dbGetQuery(pg, "ALTER TABLE calls OWNER TO streetevents")
dbGetQuery(pg, "GRANT SELECT ON calls TO streetevents_access")
dbDisconnect(pg)
I am not sure whether the semi_join(call_files)
does anything (or is necessary). Perhaps worth checking if it affects the result.
The second subsidiary table would be called something like company_ids
and would select file_name, last_update
(the primary key variables) and company_id, cusip, sedol, isin
but only for cases where at least one of these is not NULL
(SQL, which sort of equates to being not NA
in R or not None
in Python ... for this purpose, these are the same thing).
So you might have a filter something like filter(!is.na(company_id) | !is.na(cusip)
(adding in filters for isin
and sedol
. I am not sure whether unique
would be necessary. If not and the result gives a unique row for each value of the candidate primary key variables, then don't use.
My hope is that, once we get the basics under control, you will get to more interesting work. (We have discussed some ideas for Insights Series using these data that we have not yet got to.)
ALTER TABLE streetevents.calls RENAME TO streetevents.calls_raw;
The code is:
ALTER TABLE streetevents.calls RENAME TO calls_raw;
import_calls.R
(Just to keep record of it)
The file to be modified is
import_calls.R
Yes. The modifications should be very minor (and should include renaming of the file to import_calls_raw.R
).
@danielacarrasco any progress on this issue?
@danielacarrasco I had some tasks waiting on this one, so I went ahead and did it myself. It might help to look at what I did in the commits above. The idea is that calls_raw
should not need to be used by "end users"; instead, end users will use calls
and company_ids
for call-level metadata and company IDs (to link with other data sources), respectively.
I will create a task related to documenting this new structure.
Rename existing table to
calls_raw
. The_raw
is to indicate that the table is relatively unprocessed data and not intended for "end users". This involves two steps:create_calls.R
(this should probably be renamedcreate_calls_raw.R
) to use the new table name.Create subsidiary tables, one of which will be called
calls
(and effectively become the newcalls
table from an "end user" perspective).