Closed iangow closed 6 years ago
From @jingyujzhang on August 15, 2018 17:8
I rely on event_title
to parse the correct fiscal year/content now. See the new code pushed to issue iangow/executive#61.
I rely on
event_title
to parse the correct fiscal year/content now. See the new code pushed to issue 61.
How many calls are parsed as earnings calls (A)? How many as not earnings calls (B, e.g., "preliminary / outlook / interim")? How many as both A and B? How many as neither A nor B?
From @jingyujzhang on August 16, 2018 19:19
Earnings conference calls are defined as event_type == 1
. Calls with event_title
like preliminary / outlook / interim
have event_type == 1
as well, so they are counted as earnings conference calls.
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL)
#> Loading required package: DBI
pg <- dbConnect(PostgreSQL())
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
se_call_link_old <- tbl(pg, sql("SELECT * FROM executive.se_call_link_old"))
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
# Total amount of earnings conference calls
calls %>%
filter(event_type == 1, !is.na(event_title)) %>%
group_by(file_name) %>%
filter(last_update == max(last_update, na.rm = TRUE)) %>%
ungroup() %>%
distinct(file_name, event_title) %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 286081
# How many calls are parsed as earnings calls (A)?
# New link
se_call_link %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 138635
# Old link
se_call_link_old %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 146298
call_link <-
se_call_link %>% #138635
left_join(
calls %>%
filter(event_type == 1, !is.na(event_title)) %>%
group_by(file_name) %>%
filter(last_update == max(last_update, na.rm = TRUE)) %>%
ungroup() %>%
distinct(file_name, event_title)) %>% # 138635
compute()
#> Joining, by = "file_name"
# How many as both A and B?
call_link %>%
filter(event_title %~% 'Preliminary | Outlook | Interim | Business Update') %>%
count() # 249
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 249
# How many as neither A nor B?
# Sine B is contained in A, I count file_names not included in A
calls %>%
filter(event_type == 1, !is.na(event_title)) %>%
group_by(file_name) %>%
filter(last_update == max(last_update, na.rm = TRUE)) %>%
ungroup() %>%
distinct(file_name, event_title) %>%
anti_join(se_call_link, by = "file_name") %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 147446
Created on 2018-08-16 by the reprex package (v0.2.0).
Ideally B = not A. That's the point of this exercise. A means calls confirmed to be earnings announcements based on event_title
(not simply event_type==1L
, which is problematic by assumption); B means calls confirmed not to be earnings announcements based on event_title
. Questions are: Do we have A ∪ B = event_type==1L
calls? Is A ∩ B = ∅? If no to either, how close do we get?
From @jingyujzhang on August 21, 2018 23:0
How many calls are parsed as earnings calls (A)? How many as not earnings calls (B, e.g., "preliminary / outlook / interim")? How many as both A and B? How many as neither A nor B?
A or B: 285130 Only A: 265702 Only B: 18537 Both A and B: 891 Neither A nor B: 3359
library(dplyr)
library(RPostgreSQL)
#> Loading required package: DBI
pg <- dbConnect(PostgreSQL())
selected_calls <- tbl(pg, sql("SELECT * FROM streetevents.selected_calls"))
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
# How many calls are parsed as earnings calls (A)?
# How many as not earnings calls (B, e.g., "preliminary / outlook / interim")?
# How many as both A and B?
# How many as neither A nor B?
year_regex <- "(20[0-9]{2})|([0-9]Q[0-9]{2,4})"
calls %>% count() # 421897
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 421897
AorB <-
calls %>%
distinct(file_path, file_name, start_date, event_title) %>%
semi_join(selected_calls) %>%
group_by(file_name, file_path) %>%
filter(start_date == max(start_date, na.rm = TRUE)) %>%
ungroup() %>% # count() # 286080
# Fix event_title
# filter(file_name %in% c('1034190_T', '1002381_T', '1006640_T')) %>%
mutate(event_title = regexp_replace(event_title, 'First Quarter |1st Quarter |1Q ', 'Q1 ')) %>%
mutate(event_title = regexp_replace(event_title, 'Second Quarter |2nd Quarter |2Q ', 'Q2 ')) %>%
mutate(event_title = regexp_replace(event_title, 'Third Quarter |3rd Quarter |3Q ', 'Q3 ')) %>%
mutate(event_title = regexp_replace(event_title, 'Fourth Quarter |4th Quarter |4Q ', 'Q4 ')) %>%
mutate(fisc_qtr_data = regexp_matches(event_title, year_regex)) %>%
# Fix format like "4Q04"
mutate(fiscal_yr1 = sql("fisc_qtr_data[1]"),
fiscal_yr2 = sql("fisc_qtr_data[2]")) %>%
mutate(fiscal_yr2 = sql("substr(fiscal_yr2,3,4)")) %>%
mutate(fiscal_yr2 = sql("lpad(fiscal_yr2, 4, '20')")) %>%
mutate(fiscal_yr = coalesce(fiscal_yr1, fiscal_yr2) %>% as.integer()) %>%
select(-fisc_qtr_data, -fiscal_yr1, -fiscal_yr2) %>%
compute() # 285130
#> Joining, by = c("file_path", "file_name")
AorB %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 285130
# Neither A nor B ----
calls %>%
filter(event_type == 1) %>%
anti_join(AorB) %>%
count() # 3359
#> Joining, by = c("file_path", "file_name", "start_date", "event_title")
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 3359
# Both A and B----
# Case 1: contain "half year" and "Q2"
# Case 2: contain "full year" and "Q1|Q2|Q3|Q4"
# Case 3: contain "interim" and "Q1|Q2|Q3|Q4"
# Case 4: contain patterns like "and|& ... Business Update | Outlook"
AB_halfyear <-
AorB %>%
filter(event_title %~*% 'half year', event_title %~*% 'Q2')
AB_halfyear %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 64
AB_fullyear <-
AorB %>%
filter(event_title %~*% 'full year', event_title %~*% 'Q1|Q2|Q3|Q4')
AB_fullyear %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 738
AB_interim <-
AorB %>%
filter(event_title %~*% 'interim', event_title %~*% 'Q1|Q2|Q3|Q4')
AB_interim %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 49
ab_regex <- "(and|&) (.*)(?:(Business Update|Outlook))"
AB_outlook <-
AorB %>%
mutate(fisc_qtr_data = regexp_matches(event_title, ab_regex)) %>%
select(-fisc_qtr_data)
AB_outlook %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 42
bothAB <-
AB_halfyear %>%
union(AB_fullyear) %>%
union(AB_interim) %>%
union(AB_outlook) %>%
compute()
bothAB %>% count() # 891
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 891
# Only B ----
# Case 1: contain "half year" but not "Q2"
# Case 2: contain "full year" but not "Q1|Q2|Q3|Q4"
# Case 3: contain "interim" but not "Q1|Q2|Q3|Q4"
# Case 4: contain "Business Update|Outlook" but not "and|&"
# Case 5: contain "Preliminary"
onlyB_halfyear <-
AorB %>%
filter(event_title %~*% 'half year', event_title %!~*% 'Q2')
onlyB_halfyear %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 3092
onlyB_fullyear <-
AorB %>%
filter(event_title %~*% 'full year', event_title %!~*% 'Q1|Q2|Q3|Q4')
onlyB_fullyear %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 7103
onlyB_interim <-
AorB %>%
filter(event_title %~*% 'interim', event_title %!~*% 'Q1|Q2|Q3|Q4')
onlyB_interim %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 4547
onlyB_outlook <-
AorB %>%
filter(event_title %~*% 'business update|outlook', event_title %!~*% 'and|&')
onlyB_outlook %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 23
onlyB_pre <-
AorB %>%
filter(event_title %~*% 'preliminary')
onlyB_pre %>% count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 3828
onlyB <-
onlyB_halfyear %>%
union(onlyB_fullyear) %>%
union(onlyB_interim) %>%
union(onlyB_outlook) %>%
union(onlyB_pre) %>%
anti_join(bothAB) %>%
compute()
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
onlyB %>% count() # 18537
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 18537
# Only A ----
onlyA <-
AorB %>%
anti_join(onlyB) %>%
anti_join(bothAB) %>%
compute()
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
onlyA %>% count() # 265702
#> # Source: lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> # [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> n
#> <dbl>
#> 1 265702
Created on 2018-08-21 by the reprex package (v0.2.0).
From @jingyujzhang on August 22, 2018 15:15
Questions are: Do we have A ∪ B = event_type==1L calls?
No, diff = 292987 - 285130 = 7857. A ∪ B = 285130,
> calls %>% filter(event_type == 1L) %>% count() # 421897
# Source: lazy query [?? x 1]
# Database: postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
n
<dbl>
1 292987
Is A ∩ B = ∅? If no to either, how close do we get?
No, but A ∩ B is small. And from the commit above, most of these cases are "Full Year" type.
> bothAB %>% count() # 891
# Source: lazy query [?? x 1]
# Database: postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
n
<dbl>
1 891
@iangow: So right now, company level linking table se_call_link
still contains A and B, I only excluded the incorrect matches which I found out if a company had more than 5 calls in a year. I can then narrow the scope down to onlyA
and bothAB
(earnings calls (A), not earnings calls (B, e.g., "preliminary / outlook / interim")). Should I keep A ∩ B
or exclude this type as well?
From @jingyujzhang on August 24, 2018 15:56
@iangow : Any comment on this issue?
I think you have included cases in B that belong in A. Just because something is "interim" does not mean it is not an earnings announcement. In many parts of the world there are no quarters and "interim" would mean "half-year" (i.e., Q2 when reporting quarterly).
I think the words "outlook" only (i.e., without other indicators of an announcement too) or "preliminary" seem to be the indicators of cases that aren't earnings announcements.
> onlyB %>%
+ mutate(adjective = regexp_matches(event_title, "^.*?(?=\\s*20\\d{2,})")) %>%
+ mutate(adjective = unnest(adjective)) %>%
+ count(adjective) %>%
+ arrange(desc(n)) %>%
+ print(n=20)
# Source: lazy query [?? x 2]
# Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
# Ordered by: desc(n)
adjective n
<chr> <dbl>
1 Full Year 6985
2 Interim 4496
3 Preliminary 3342
4 Half Year 3045
5 Preliminary Q4 82
6 Preliminary Q1 64
7 Preliminary Q3 54
8 Preliminary Q2 40
9 Q4 25
10 "" 23
11 Q3 21
12 Full Year Results 18
13 Preliminary Full Year 16
14 Full year 15
15 Q2 15
16 Q1 12
17 Preliminary FY 10
18 Half Year Results 10
19 Interim Results 9
20 Half year 6
# ... with more rows
OK. One approach to identifying problems is to flag cases with more than one event_type==1
call per permno
in a given quarter (there can't be more than 5 calls in a year without at least one quarter being a problem).
I put the results of the following code here. It seems that most such cases are fine, but I'd say that more problem cases are created by bad PERMNO matches than by inclusion of "preliminary" cases.
The way to fix the bad PERMNO matches is to do so here. For example, file_name
equal to 515127_T
has been manually matched to permno==10104L
thanks to "Fix by Nastia/Vincent in January 2015", but it should be matched to permno==10078L
. (I fixed this one, so it should go away when crsp_link
code is run again.)
Some cases where there are more than one call per quarter are legitimate. For example,
morning sessions and afternoon sessions
But these are sufficiently rare, that I think they're not worth worrying too much about.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
#> Loading required package: DBI
pg <- dbConnect(RPostgreSQL::PostgreSQL())
rs <- dbExecute(pg, "SET search_path TO streetevents, public")
rs <- dbExecute(pg, "SET work_mem = '10GB'")
selected_calls <- tbl(pg, "selected_calls")
calls <- tbl(pg, "calls")
crsp_link <- tbl(pg, "crsp_link")
selected_calls <-
calls %>%
group_by(file_name) %>%
summarize(last_update = max(last_update, na.rm = TRUE)) %>%
compute(index = "file_name")
selected_calls %>% filter(file_name == "10784087_T")
problems <-
crsp_link %>%
select(-matches("matcH_type")) %>%
filter(!is.na(permno)) %>%
inner_join(calls) %>%
semi_join(selected_calls) %>%
mutate(start_date = sql("start_date::date")) %>%
filter(event_type == 1L) %>%
distinct(file_name, last_update, permno, start_date, event_title) %>%
mutate(quarter = date_part('quarter', start_date),
year = date_part('year', start_date)) %>%
group_by(permno, year, quarter) %>%
filter(n() > 1) %>%
ungroup() %>%
select(file_name, last_update, permno, quarter, year, start_date, event_title) %>%
arrange(permno, year, quarter) %>%
compute()
library(readr)
problems %>%
collect() %>%
write_csv("~/Google Drive/data/streetevents/mult_in_qtr.csv")
Hopefully addressing "Classify cases with multiple calls in quarter" will make it easy to address this one too.
Moved issue from executive
to here, as it has nothing to do with executive
.
@iangow : Mult_in_qtr is done. Each problematic call can be put into permno
, multiple
, non_earnings
, not_sure
and preliminary
.
For the permno
type, I need to correct them in streetevents.manual_permno_matches.
For the other types, are we going to exclude them when calculating streetevents.crsp_link
? If not, I need to add more match_type
s.
@iangow : Mult_in_qtr is done. Each problematic call can be put into
permno
,multiple
,non_earnings
,not_sure
andpreliminary
.For the
permno
type, I need to correct them in streetevents.manual_permno_matches.
I already did that.
For the other types, are we going to exclude them when calculating
streetevents.crsp_link
? If not, I need to add morematch_type
s.
Don't do anything yet.
Make sure to pull the latest code from StreetEvents before committing any more code ... I had some issues with the two repositories (streetevents
and streetevents_private
) getting out of sync and may have complicated the commit history in fixing this.
I will close this one too. I will create a follow-up issue to handle the "problematic calls" flagged in mult_in_qtr
. Let me know if anything is urgently depending on this code.
From @iangow on July 30, 2018 23:35
See comment here.
For now, I would just get the (
dplyr
-based) code set up. It probably makes sense to start withse_crsp_link
and try to classify calls there. Use 'find in files' in RStudio to searchstreetevents
andexecutive
repositories for code that usesevent_title
(?), as I think we may have useful code snippets for parsing this field.This relates to #61.
Copied from original issue: iangow/executive#63