Closed iangow closed 6 years ago
From @azakolyukina on July 5, 2018 23:51
@jingyujzhang:
Note that both se_calls_link
and se_exec_link
have to be updated.
@jingyujzhang
@azakolyukina I'm not sure what "the logic in 3bb8473" is.
Below is some code that may help. It probably makes sense to extract the company name from event_title
and manually map distinct company names along with cusip
or some other identifier to permno
and then blow that mapping up into a mapping from file_name
to permo
and add to se_call_link
.
Please check that these aren't in the manual matches that feed into se_call_link
. Not sure why these are still there. This is a legacy issue for calls before ~2013 and should not affect more recent calls.
Also we should remove file_path
and start_date
from se_call_link
.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
se_link <-
se_call_link %>%
select(-file_path, -start_date)
calls_mod <-
calls %>%
filter(event_type == 1L) %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))
dupes <-
calls_mod %>%
inner_join(se_link) %>%
group_by(company_id, fisc_qtr_data) %>%
filter(n() > 1) %>%
ungroup() %>%
compute()
#> Joining, by = "file_name"
dupes
#> # Source: table<ywycxfsbfy> [?? x 8]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> file_name cusip start_date event_title fisc_qtr_data permno
#> <chr> <chr> <dttm> <chr> <chr> <int>
#> 1 666954_T 11162… 2002-10-23 19:00:00 Q3 2002 Foun… 20023 86881
#> 2 651583_T 11162… 2002-08-14 18:30:00 Q3 2002 Broc… 20023 86881
#> 3 696314_T 11162… 2003-01-29 21:00:00 Q4 2002 Foun… 20024 86881
#> 4 675859_T 11162… 2002-11-21 20:30:00 Q4 2002 Broc… 20024 86881
#> 5 707175_T 11162… 2003-02-12 20:30:00 Q1 2003 Broc… 20031 86881
#> 6 723669_T 11162… 2003-04-23 19:00:00 Q1 2003 Foun… 20031 86881
#> 7 763437_T 11162… 2003-07-23 19:00:00 Q2 2003 Foun… 20032 86881
#> 8 734682_T 11162… 2003-05-14 19:30:00 Q2 2003 Broc… 20032 86881
#> 9 796184_T 11162… 2003-10-22 19:00:00 Q3 2003 Foun… 20033 86881
#> 10 769964_T 11162… 2003-08-13 19:00:00 Q3 2003 Broc… 20033 86881
#> # ... with more rows, and 2 more variables: company_id <int>,
#> # fy_end <date>
dupes %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 6330
dupes %>%
distinct(company_id) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 495
dupes %>%
group_by(company_id) %>%
count() %>%
arrange(desc(n))
#> # Source: lazy query [?? x 2]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups: company_id
#> # Ordered by: desc(n)
#> company_id n
#> <int> <dbl>
#> 1 3897 125
#> 2 2874 114
#> 3 4215 103
#> 4 3970 89
#> 5 4427 72
#> 6 4063 69
#> 7 4241 66
#> 8 4930 62
#> 9 3478 58
#> 10 4604 56
#> # ... with more rows
Created on 2018-07-06 by the reprex package (v0.2.0).
From @azakolyukina on July 6, 2018 0:30
@azakolyukina I'm not sure what "the logic in 3bb8473" is.
To resolve incorrect matches, I compare company name in Equilar
with company name in event_title
. This is how we discovered the mess in streetevents
tickers! I do NOT use manual mapping in 3bb8473---just take a careful look.
You must be referring to a different commit, I don't see anything about company names in https://github.com/azakolyukina/nonans_product/commit/3bb8473b509256c8179e57c671c30535cd705568.
This is how we discovered the mess in StreetEvents tickers!
But then we did a huge manual-matching exercise and we should use the results of that (none of these cases should be new). I don't know why we still have cases like this.
From @azakolyukina on July 6, 2018 0:51
You must be referring to a different commit, I don't see anything about company names in azakolyukina/nonans_product@3bb8473.
This is very strange. Your are right! I am trying to recover and understand how it even happened that I lost my original code.
From @azakolyukina on July 6, 2018 0:52
I don't know why we still have cases like this.
..probably because I have a tendency to find bugs
OK. Not clear why these bugs weren't found in 2014. Note that these issues need to be fixed in streetevents.crsp_link
, not here. Maybe that's where you have the code.
From @azakolyukina on July 6, 2018 0:58
(@jingyujzhang In the above, I am referring to this "logic")
Here, I found it nonans_product/create_data/create_se_link.R follow comparisons with
mutate(is_aka_match = (call_desc %~*% aka_company_name),
is_preliminary = (call_desc %~*% 'preliminary')) %>%
From @azakolyukina on July 6, 2018 1:3
Note that these issues need to be fixed in streetevents.crsp_link, not here.
@jingyujzhang: please follows @iangow instructions and fix it in crsp_link
. I think my approach above will still apply.
You don't need Equilar to see these issues:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
crsp_link <- tbl(pg, sql("SELECT * FROM streetevents.crsp_link"))
se_link <-
crsp_link %>%
filter(!is.na(permno)) %>%
distinct(file_name, permno)
calls_mod <-
calls %>%
filter(event_type == 1L) %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))
dupes <-
calls_mod %>%
inner_join(se_link) %>%
group_by(permno, fisc_qtr_data) %>%
filter(n() > 1) %>%
ungroup() %>%
compute()
#> Joining, by = "file_name"
dupes
#> # Source: table<vlwustzbke> [?? x 6]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> file_name cusip start_date event_title fisc_qtr_data permno
#> <chr> <chr> <dttm> <chr> <chr> <int>
#> 1 562913_T 68389… 2001-12-13 21:30:00 Q2 2002 Orac… 20022 10104
#> 2 646764_T 68389… 2002-07-18 19:00:00 Q2 2002 Peop… 20022 10104
#> 3 645421_T 68389… 2002-07-23 19:00:00 Q2 2002 SeeB… 20022 10104
#> 4 640000_T 68389… 2002-07-17 13:00:00 Q2 2002 Rete… 20022 10104
#> 5 619670_T 68389… 2002-05-22 19:00:00 Q2 2002 J.D.… 20022 10104
#> 6 603385_T 68389… 2002-03-14 21:30:00 Q3 2002 Orac… 20023 10104
#> 7 667639_T 68389… 2002-10-17 19:00:00 Q3 2002 Peop… 20023 10104
#> 8 666050_T 68389… 2002-10-23 19:00:00 Q3 2002 SeeB… 20023 10104
#> 9 696301_T 68389… 2003-01-22 21:00:00 Q4 2002 Peop… 20024 10104
#> 10 695782_T 68389… 2003-01-28 21:00:00 Q4 2002 SeeB… 20024 10104
#> # ... with more rows
dupes %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 7002
dupes %>%
distinct(permno) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 621
dupes %>%
group_by(permno) %>%
count() %>%
arrange(desc(n))
#> # Source: lazy query [?? x 2]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups: permno
#> # Ordered by: desc(n)
#> permno n
#> <int> <dbl>
#> 1 21178 125
#> 2 85789 114
#> 3 10104 103
#> 4 81061 89
#> 5 50017 72
#> 6 22779 69
#> 7 35051 68
#> 8 75819 62
#> 9 90441 61
#> 10 37584 58
#> # ... with more rows
Created on 2018-07-06 by the reprex package (v0.2.0).
@jingyujzhang
If you can fill out this spreadsheet, then I think it should be easy to use that data to resolve these issues.
Code to create the spreadsheet above:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(readr)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
rs <- dbExecute(pg, "SET work_mem = '10GB'")
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
crsp_link <- tbl(pg, sql("SELECT * FROM streetevents.crsp_link"))
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
crsp_names <-
stocknames %>%
distinct(permno, comnam) %>%
group_by(permno) %>%
summarize(comnams = array_agg(comnam)) %>%
compute()
se_link <-
crsp_link %>%
filter(!is.na(permno)) %>%
distinct(file_name, permno)
# |Q&A Session
regex <- "(?:Earnings(?: Conference Call)?|Financial and Operating Results|Financial Results Call|"
regex <- paste0(regex, "Results Conference Call|Analyst Meeting)")
regex <- paste0("^(.*) (", regex, ")")
qtr_regex <- "(Preliminary Half Year|Full Year|Q[1-4])"
year_regex <- "(20[0-9]{2}(?:-[0-9]{2}|/20[0-9]{2})?)"
period_regex <- paste0("^", qtr_regex, " ", year_regex," (.*)")
calls_mod <-
calls %>%
filter(event_type == 1L) %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, period_regex)) %>%
mutate(event_co_name = sql("fisc_qtr_data[3]")) %>%
mutate(event_co_name = regexp_matches(event_co_name, regex)) %>%
mutate(event_co_name = sql("event_co_name[1]"),
event_desc = sql("event_co_name[2]")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[1] || ' ' || fisc_qtr_data[2]")) %>%
compute()
calls %>%
filter(event_type == 1L) %>%
anti_join(calls_mod, by = c("file_name")) %>% s
dupes <-
calls_mod %>%
inner_join(se_link) %>%
group_by(permno, fisc_qtr_data) %>%
filter(n() > 1) %>%
ungroup() %>%
compute()
dupes
dupes %>%
count()
dupes %>%
distinct(permno) %>%
count()
dupes %>%
group_by(permno) %>%
count() %>%
arrange(desc(n))
dupes %>%
inner_join(crsp_names) %>%
distinct(event_co_name, permno, comnams) %>%
arrange(permno) %>%
collect() %>%
write_csv("~/Google Drive/data/streetevents/name_checks.csv")
From @jingyujzhang on July 6, 2018 5:44
@iangow: For merges / reorganizations, the old companies are not the same as the new company (i.e. choose FALSE in the google sheet), right?
Unless the name (or an obvious variant thereon) appears in the list of names, then FALSE
.
If in doubt, make a comment with +iandgow@gmail.com
and I will respond.
From @jingyujzhang on July 6, 2018 7:14
@iangow: I'm done with the google sheet and I've inserted comments as well.
From @azakolyukina on July 25, 2018 19:46
@iangow and @jingyujzhang:
Are you done with this issue?
From @jingyujzhang on July 25, 2018 20:14
Yes, this issue can be closed.
We should not close the issue until we have confirmed that the problem it is intended to solve has gone away. It seems it hasn't. Have we incorporated the data we collected into the relevant table?
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
calls_mod <-
calls %>%
filter(event_type == 1L) %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))
dupes <-
calls_mod %>%
inner_join(se_call_link) %>%
group_by(company_id, fisc_qtr_data) %>%
filter(n() > 1) %>%
ungroup() %>%
compute()
#> Joining, by = "file_name"
dupes
#> # Source: table<yhfkxubjvv> [?? x 8]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> file_name cusip start_date event_title fisc_qtr_data permno
#> <chr> <chr> <dttm> <chr> <chr> <int>
#> 1 3645877_T 12618… 2011-01-13 13:00:00 Q4 2010 Char… 20104 85972
#> 2 3645877_T 12618… 2011-01-13 13:00:00 Q4 2010 Char… 20104 85972
#> 3 3967370_T 12618… 2011-04-28 11:00:00 Q1 2011 CRA … 20111 85972
#> 4 3967370_T 12618… 2011-04-28 11:00:00 Q1 2011 CRA … 20111 85972
#> 5 4156440_T 12618… 2011-07-28 11:00:00 Q2 2011 CRA … 20112 85972
#> 6 4156440_T 12618… 2011-07-28 11:00:00 Q2 2011 CRA … 20112 85972
#> 7 4218416_T 12618… 2011-10-27 12:00:00 Q3 2011 CRA … 20113 85972
#> 8 4218416_T 12618… 2011-10-27 12:00:00 Q3 2011 CRA … 20113 85972
#> 9 1992773_T 31561… 2008-10-07 23:30:00 Q4 2008 F5 N… 20084 86964
#> 10 1998857_T 31561… 2008-10-22 19:30:00 Q4 2008 F5 N… 20084 86964
#> # ... with more rows, and 2 more variables: company_id <int>,
#> # fy_end <date>
dupes %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 1346
dupes %>%
distinct(company_id) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 270
dupes %>%
group_by(company_id) %>%
count() %>%
arrange(desc(n))
#> # Source: lazy query [?? x 2]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups: company_id
#> # Ordered by: desc(n)
#> company_id n
#> <int> <dbl>
#> 1 2874 114
#> 2 3478 58
#> 3 4633 54
#> 4 20591 48
#> 5 40600 24
#> 6 3605 22
#> 7 9219 22
#> 8 11241 18
#> 9 5649 16
#> 10 36240 16
#> # ... with more rows
Created on 2018-07-26 by the reprex package (v0.2.0).
It may be that the code for dupes
above needs refinement to exclude cases that are not really problems. But in any case, we should confirm that all is OK.
From @azakolyukina on July 26, 2018 16:3
But in any case, we should confirm that all is OK.
@jingyujzhang: Could you please confirm that you have resolved that.
From @jingyujzhang on July 26, 2018 20:30
@iangow (cc: @azakolyukina ): The code above is problematic. Streetevent.calls
table can't be used directly, we have computed streetevent.selected_calls
before.
(company_id, fisc_qtr_data)
to have more than 1 files according to calls
records.336
dupes below are due to change of fy_end
. If a company changed its fiscal year end at some point, the records in proxy_company
where we get fy_end
can have 2 records within a year (See the below example of company_id = 2517
). I think it's OK to leave these dups as they are.> proxy_company %>%
+ filter(company_id == 2517) %>%
+ arrange(fy_end)
# Source: lazy query [?? x 2]
# Database: postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
# Ordered by: fy_end
company_id fy_end
<int> <date>
1 2517 2000-09-30
2 2517 2001-09-30
3 2517 2002-09-30
4 2517 2002-12-31
5 2517 2003-12-31
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
selected_calls <- tbl(pg, sql("select * from public.selected_calls"))
calls_mod <- # file_name, cusip, start_date, event_title, fisc_qtr_data
selected_calls %>%
inner_join(calls) %>%
group_by(file_name) %>%
filter(last_update == max(last_update, na.rm = TRUE)) %>%
ungroup() %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]")) %>%
compute()
#> Joining, by = c("file_name", "file_path")
dupes <-
calls_mod %>%
inner_join(se_call_link) %>% # by = "file_name"
group_by(company_id, fisc_qtr_data, file_name) %>% # <- Add file_name here
filter(n() > 1) %>%
ungroup() %>%
arrange(file_name, fisc_qtr_data, cusip, start_date, fy_end) %>%
select(file_name, fisc_qtr_data, cusip, start_date, fy_end, everything()) %>%
compute()
#> Joining, by = "file_name"
dupes %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 336
Created on 2018-07-26 by the reprex package (v0.2.0).
@jingyujzhang
I made some comments on the commit.
We should edit the "test code" to use the correct table. Also it should be pretty easy to address issues created by short fiscal years. We should aim to eliminate unidentified issues, then close the issue. (These cases slipped through the cracks back in late 2014!)
Prerequisite for https://github.com/azakolyukina/nonans_product/issues/90
From @jingyujzhang on July 29, 2018 9:41
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgreSQL::PostgreSQL())
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
selected_calls <- tbl(pg, sql("select * from streetevents.selected_calls"))
calls_mod <- # file_name, cusip, start_date, event_title, fisc_qtr_data
selected_calls %>%
inner_join(calls) %>%
group_by(file_name) %>%
filter(last_update == max(last_update, na.rm = TRUE)) %>%
ungroup() %>%
distinct(file_name, cusip, start_date, event_title) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]")) %>%
compute()
#> Joining, by = c("file_name", "file_path")
dupes <-
calls_mod %>%
inner_join(se_call_link) %>% # by = "file_name"
group_by(company_id, fisc_qtr_data, file_name) %>% # <- Add file_name here
filter(n() > 1) %>%
ungroup() %>%
arrange(file_name, fisc_qtr_data, cusip, start_date, fy_end) %>%
select(file_name, fisc_qtr_data, cusip, start_date, fy_end, everything()) %>%
compute()
#> Joining, by = "file_name"
dupes
#> # Source: lazy query [?? x 8]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> # Ordered by: file_name, fisc_qtr_data, cusip, start_date, fy_end
Created on 2018-07-29 by the reprex package (v0.2.0).
Why public.selected_calls
? Is this not in StreetEvents?
From @jingyujzhang on July 29, 2018 9:48
I updated the last comment. selected_calls
is in streetevents
.
se_call_link
, se_exec_link
, crsp_link
, selected_calls
are up to date.
@jingyujzhang
@azakolyukina closed https://github.com/azakolyukina/nonans_product/issues/90 but not this issue.but I think that issue depends on this one, which is still open.
From @azakolyukina on July 29, 2018 22:51
@iangow:
@azakolyukina closed azakolyukina/nonans_product#90 but not this issue.but I think that issue depends on this one, which is still open.
I have not found new (obvious) glitches for https://github.com/azakolyukina/nonans_product/issues/90. However, the table se_call_link
can still be improved based on a glitched that surfaced up and was corrected for https://github.com/azakolyukina/nonans_product/issues/90
@jingyujzhang Can you make sense of this issue? I am not sure what this is about. I don't know what the "glitch that surfaced … and was corrected for azakolyukina/nonans_product#90" is.
@iangow : It's related to the consistency check Nastia did there - to make sure no more than 1 calls per fisc_qtr_data, no more than 5 calls per fiscal year (manually check those existed 5 calls to see whether the extra calls make sense), 1 file_name
matches 1 company_id
, etc.
Previously, se_call_link
didn't pass the consistency check, but I modified it. se_link
table is based on the modified version of se_call_link
and it has been finalized. Even if se_call_link
becomes more perfect in the future, se_link
won't change.
There is no se_link
or se_calls_link
table here. Where are these tables? Where is the code that creates them?
In executive
, I see these:
postgres=# \dt+ executive.se_*
List of relations
Schema | Name | Type | Owner | Size | Description
-----------+------------------+-------+-----------+---------+---------------------------------------------------------------------------------------
executive | se_call_link | table | executive | 7080 kB | CREATED USING create_se_call_link.R ON 2018-08-15 10:46:14
executive | se_call_link_old | table | executive | 7464 kB | CREATED USING create_se_call_link.R ON 2018-07-29 18:08:52
executive | se_exec_link | table | executive | 25 MB | CREATED USING create_se_exec_link.R IN GitHub iangow/executive ON 2018-08-15 11:25:39
executive | se_exec_link_old | table | executive | 27 MB | CREATED USING create_se_exec_link.R IN GitHub iangow/executive ON 2018-07-29 18:13:02
(4 rows)
But no se_link
. If we are not using the _old
tables, we should probably delete them.
Old tables are there for comparison in case I miss anything. I deleted them.
OK. I will close this issue. I will revisit executive.se_call_link
at a later date.
From @azakolyukina on July 5, 2018 23:49
@jingyujzhang (@iangow ):
se_calls_link
contains incorrect matches such as belowOne way to resolve those, is to follow the logic in 3bb8473 (@jingyujzhang we did talk about this before with you)
The idea is that for multiple matches in
{company_id, fisc_qtr_data}
, impose restriction on company name.fisc_qtr_data
can be extracted as in nonans_rateCopied from original issue: iangow/executive#60