iangow / se_core

Core code for StreetEvents data
7 stars 5 forks source link

Restructure calls table #4

Closed iangow closed 5 years ago

iangow commented 5 years ago
  1. Rename existing table to calls_raw. The _raw is to indicate that the table is relatively unprocessed data and not intended for "end users". This involves two steps:

    • Change the existing table using SQL. I think the following will work (Because this is a one-off, there's no need to put this statement anywhere in GitHub.)
      ALTER TABLE streetevents.calls RENAME TO streetevents.calls_raw;
    • Modify the code in create_calls.R (this should probably be renamed create_calls_raw.R) to use the new table name.
  2. Create subsidiary tables, one of which will be called calls (and effectively become the new calls table from an "end user" perspective).

    • I will detail this in comments below in a little while.
iangow commented 5 years ago

Here is some untested(!) R code for one of the two subsidiary tables:


library(DBI)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)

pg <- dbConnect(RPostgres:Postgres())
dbExecute(pg, "SET work_mem TO '3GB'") 

# Define PG data ----

# Project schema
rs <- dbExecute(pg, "SET search_path TO streetevents, public")

calls_raw <- tbl(pg, "calls_raw"))

rs <- dbExecute(pg, "DROP TABLE IF EXISTS calls")

calls <-
    calls_raw %>%
    semi_join(file_list, by = c("file_path", "file_name", "last_update")) %>%
    select(-file_path, -sha1, -company_id, -cusip, -sedol, -isin) %>%
    distinct() %>%
    compute(name = "calls", temporary = FALSE)

# Should check that calls are uniquely identified by file_name and
# last_update
calls %>%
    group_by(file_name, last_update) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    count()

db_comment <- paste0("CREATED USING create_calls.R from ",
                     "GitHub iangow/se_core ON ", Sys.time())
dbGetQuery(pg, paste0("COMMENT ON TABLE calls IS '",
                      db_comment, "';"))

dbGetQuery(pg, "ALTER TABLE calls OWNER TO streetevents")
dbGetQuery(pg, "GRANT SELECT ON calls TO streetevents_access")

dbDisconnect(pg)

I am not sure whether the semi_join(call_files) does anything (or is necessary). Perhaps worth checking if it affects the result.

iangow commented 5 years ago

The second subsidiary table would be called something like company_ids and would select file_name, last_update (the primary key variables) and company_id, cusip, sedol, isin but only for cases where at least one of these is not NULL (SQL, which sort of equates to being not NA in R or not None in Python ... for this purpose, these are the same thing).

So you might have a filter something like filter(!is.na(company_id) | !is.na(cusip) (adding in filters for isin and sedol. I am not sure whether unique would be necessary. If not and the result gives a unique row for each value of the candidate primary key variables, then don't use.

iangow commented 5 years ago

My hope is that, once we get the basics under control, you will get to more interesting work. (We have discussed some ideas for Insights Series using these data that we have not yet got to.)

danielacarrasco commented 5 years ago
ALTER TABLE streetevents.calls RENAME TO streetevents.calls_raw;

The code is:

ALTER TABLE streetevents.calls RENAME TO calls_raw;

(Just to keep record of it)

iangow commented 5 years ago

The file to be modified is import_calls.R

Yes. The modifications should be very minor (and should include renaming of the file to import_calls_raw.R).

iangow commented 5 years ago

@danielacarrasco any progress on this issue?

iangow commented 5 years ago

@danielacarrasco I had some tasks waiting on this one, so I went ahead and did it myself. It might help to look at what I did in the commits above. The idea is that calls_raw should not need to be used by "end users"; instead, end users will use calls and company_ids for call-level metadata and company IDs (to link with other data sources), respectively.

I will create a task related to documenting this new structure.