iangow / streetevents_private

2 stars 1 forks source link

Classify calls as earnings announcements or not. #39

Closed iangow closed 6 years ago

iangow commented 6 years ago

From @iangow on July 30, 2018 23:35

See comment here.

For now, I would just get the (dplyr-based) code set up. It probably makes sense to start with se_crsp_link and try to classify calls there. Use 'find in files' in RStudio to search streetevents and executive repositories for code that uses event_title (?), as I think we may have useful code snippets for parsing this field.

This relates to #61.

Copied from original issue: iangow/executive#63

iangow commented 6 years ago

From @jingyujzhang on August 15, 2018 17:8

I rely on event_title to parse the correct fiscal year/content now. See the new code pushed to issue iangow/executive#61.

iangow commented 6 years ago

I rely on event_title to parse the correct fiscal year/content now. See the new code pushed to issue 61.

How many calls are parsed as earnings calls (A)? How many as not earnings calls (B, e.g., "preliminary / outlook / interim")? How many as both A and B? How many as neither A nor B?

iangow commented 6 years ago

From @jingyujzhang on August 16, 2018 19:19

Earnings conference calls are defined as event_type == 1. Calls with event_title like preliminary / outlook / interim have event_type == 1 as well, so they are counted as earnings conference calls.

library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL)
#> Loading required package: DBI

pg <- dbConnect(PostgreSQL())

se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))
se_call_link_old <- tbl(pg, sql("SELECT * FROM executive.se_call_link_old"))
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))

# Total amount of earnings conference calls
calls %>% 
    filter(event_type == 1, !is.na(event_title)) %>% 
    group_by(file_name) %>% 
    filter(last_update == max(last_update, na.rm = TRUE)) %>% 
    ungroup() %>% 
    distinct(file_name, event_title) %>% 
    count() 
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 286081

# How many calls are parsed as earnings calls (A)?
# New link
se_call_link %>% count() 
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 138635

# Old link
se_call_link_old %>% count() 
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 146298

call_link <- 
    se_call_link %>% #138635
    left_join(
        calls %>% 
            filter(event_type == 1, !is.na(event_title)) %>% 
            group_by(file_name) %>% 
            filter(last_update == max(last_update, na.rm = TRUE)) %>% 
            ungroup() %>% 
            distinct(file_name, event_title)) %>% # 138635
    compute()
#> Joining, by = "file_name"

# How many as both A and B?
call_link %>% 
    filter(event_title %~% 'Preliminary | Outlook | Interim | Business Update') %>% 
    count() # 249
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   249

# How many as neither A nor B?
# Sine B is contained in A, I count file_names not included in A
calls %>% 
    filter(event_type == 1, !is.na(event_title)) %>% 
    group_by(file_name) %>% 
    filter(last_update == max(last_update, na.rm = TRUE)) %>% 
    ungroup() %>% 
    distinct(file_name, event_title) %>% 
    anti_join(se_call_link, by = "file_name") %>% 
    count() 
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 147446

Created on 2018-08-16 by the reprex package (v0.2.0).

iangow commented 6 years ago

Ideally B = not A. That's the point of this exercise. A means calls confirmed to be earnings announcements based on event_title (not simply event_type==1L, which is problematic by assumption); B means calls confirmed not to be earnings announcements based on event_title. Questions are: Do we have A ∪ B = event_type==1L calls? Is A ∩ B = ∅? If no to either, how close do we get?

iangow commented 6 years ago

From @jingyujzhang on August 21, 2018 23:0

How many calls are parsed as earnings calls (A)? How many as not earnings calls (B, e.g., "preliminary / outlook / interim")? How many as both A and B? How many as neither A nor B?

A or B: 285130 Only A: 265702 Only B: 18537 Both A and B: 891 Neither A nor B: 3359

library(dplyr)
library(RPostgreSQL)
#> Loading required package: DBI
pg <- dbConnect(PostgreSQL())

selected_calls <- tbl(pg, sql("SELECT * FROM streetevents.selected_calls")) 
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))

# How many calls are parsed as earnings calls (A)? 
# How many as not earnings calls (B, e.g., "preliminary / outlook / interim")? 
# How many as both A and B? 
# How many as neither A nor B?
year_regex <- "(20[0-9]{2})|([0-9]Q[0-9]{2,4})"
calls %>% count() # 421897
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 421897
AorB <-
    calls %>% 
    distinct(file_path, file_name, start_date, event_title) %>% 
    semi_join(selected_calls) %>% 
    group_by(file_name, file_path) %>% 
    filter(start_date == max(start_date, na.rm = TRUE)) %>% 
    ungroup() %>% # count() # 286080
    # Fix event_title
    # filter(file_name %in% c('1034190_T', '1002381_T', '1006640_T')) %>% 
    mutate(event_title = regexp_replace(event_title, 'First Quarter |1st Quarter |1Q ', 'Q1 ')) %>% 
    mutate(event_title = regexp_replace(event_title, 'Second Quarter |2nd Quarter |2Q ', 'Q2 ')) %>% 
    mutate(event_title = regexp_replace(event_title, 'Third Quarter |3rd Quarter |3Q ', 'Q3 ')) %>% 
    mutate(event_title = regexp_replace(event_title, 'Fourth Quarter |4th Quarter |4Q ', 'Q4 ')) %>% 
    mutate(fisc_qtr_data = regexp_matches(event_title, year_regex)) %>% 
    # Fix format like "4Q04"
    mutate(fiscal_yr1 = sql("fisc_qtr_data[1]"),
           fiscal_yr2 = sql("fisc_qtr_data[2]")) %>% 
    mutate(fiscal_yr2 = sql("substr(fiscal_yr2,3,4)")) %>% 
    mutate(fiscal_yr2 = sql("lpad(fiscal_yr2, 4, '20')")) %>% 
    mutate(fiscal_yr = coalesce(fiscal_yr1, fiscal_yr2) %>% as.integer()) %>% 
    select(-fisc_qtr_data, -fiscal_yr1, -fiscal_yr2) %>% 
    compute() # 285130
#> Joining, by = c("file_path", "file_name")

AorB %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 285130
# Neither A nor B ----
calls %>% 
    filter(event_type == 1) %>% 
    anti_join(AorB) %>% 
    count() # 3359
#> Joining, by = c("file_path", "file_name", "start_date", "event_title")
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  3359

# Both A and B----
# Case 1: contain "half year" and "Q2"
# Case 2: contain "full year" and "Q1|Q2|Q3|Q4"
# Case 3: contain "interim" and "Q1|Q2|Q3|Q4"
# Case 4: contain patterns like "and|& ... Business Update | Outlook"
AB_halfyear <- 
    AorB %>% 
    filter(event_title %~*% 'half year', event_title %~*% 'Q2')
AB_halfyear %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1    64
AB_fullyear <- 
    AorB %>% 
    filter(event_title %~*% 'full year', event_title %~*% 'Q1|Q2|Q3|Q4')
AB_fullyear %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   738
AB_interim <- 
    AorB %>% 
    filter(event_title %~*% 'interim', event_title %~*% 'Q1|Q2|Q3|Q4')
AB_interim %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1    49
ab_regex <- "(and|&) (.*)(?:(Business Update|Outlook))"
AB_outlook <- 
    AorB %>% 
    mutate(fisc_qtr_data = regexp_matches(event_title, ab_regex)) %>% 
    select(-fisc_qtr_data)
AB_outlook %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1    42
bothAB <- 
    AB_halfyear %>% 
    union(AB_fullyear) %>% 
    union(AB_interim) %>% 
    union(AB_outlook) %>% 
    compute()
bothAB %>% count() # 891
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   891
# Only B ----
# Case 1: contain "half year" but not "Q2"
# Case 2: contain "full year" but not "Q1|Q2|Q3|Q4"
# Case 3: contain "interim" but not "Q1|Q2|Q3|Q4"
# Case 4: contain "Business Update|Outlook" but not "and|&"
# Case 5: contain "Preliminary"
onlyB_halfyear <- 
    AorB %>% 
    filter(event_title %~*% 'half year', event_title %!~*% 'Q2')
onlyB_halfyear %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  3092
onlyB_fullyear <- 
    AorB %>% 
    filter(event_title %~*% 'full year', event_title %!~*% 'Q1|Q2|Q3|Q4')
onlyB_fullyear %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  7103
onlyB_interim <- 
    AorB %>% 
    filter(event_title %~*% 'interim', event_title %!~*% 'Q1|Q2|Q3|Q4')
onlyB_interim %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  4547
onlyB_outlook <- 
    AorB %>% 
    filter(event_title %~*% 'business update|outlook', event_title %!~*% 'and|&')
onlyB_outlook %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1    23
onlyB_pre <- 
    AorB %>% 
    filter(event_title %~*% 'preliminary')
onlyB_pre %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  3828
onlyB <- 
    onlyB_halfyear %>% 
    union(onlyB_fullyear) %>% 
    union(onlyB_interim) %>% 
    union(onlyB_outlook) %>% 
    union(onlyB_pre) %>% 
    anti_join(bothAB) %>% 
    compute()
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
onlyB %>% count() # 18537
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1 18537

# Only A ----
onlyA <- 
    AorB %>% 
    anti_join(onlyB) %>% 
    anti_join(bothAB) %>% 
    compute()
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
#> Joining, by = c("file_path", "file_name", "start_date", "event_title", "fiscal_yr")
onlyA %>% count() # 265702
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>        n
#>    <dbl>
#> 1 265702

Created on 2018-08-21 by the reprex package (v0.2.0).

iangow commented 6 years ago

From @jingyujzhang on August 22, 2018 15:15

Questions are: Do we have A ∪ B = event_type==1L calls?

No, diff = 292987 - 285130 = 7857. A ∪ B = 285130,

> calls %>% filter(event_type == 1L) %>% count() # 421897
# Source:   lazy query [?? x 1]
# Database: postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
       n
   <dbl>
1 292987

Is A ∩ B = ∅? If no to either, how close do we get?

No, but A ∩ B is small. And from the commit above, most of these cases are "Full Year" type.

> bothAB %>% count() # 891
# Source:   lazy query [?? x 1]
# Database: postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
      n
  <dbl>
1   891

@iangow: So right now, company level linking table se_call_link still contains A and B, I only excluded the incorrect matches which I found out if a company had more than 5 calls in a year. I can then narrow the scope down to onlyA and bothAB (earnings calls (A), not earnings calls (B, e.g., "preliminary / outlook / interim")). Should I keep A ∩ B or exclude this type as well?

iangow commented 6 years ago

From @jingyujzhang on August 24, 2018 15:56

@iangow : Any comment on this issue?

iangow commented 6 years ago

I think you have included cases in B that belong in A. Just because something is "interim" does not mean it is not an earnings announcement. In many parts of the world there are no quarters and "interim" would mean "half-year" (i.e., Q2 when reporting quarterly).

I think the words "outlook" only (i.e., without other indicators of an announcement too) or "preliminary" seem to be the indicators of cases that aren't earnings announcements.

> onlyB %>% 
+     mutate(adjective = regexp_matches(event_title, "^.*?(?=\\s*20\\d{2,})")) %>% 
+     mutate(adjective = unnest(adjective)) %>% 
+     count(adjective) %>% 
+     arrange(desc(n)) %>%
+     print(n=20)
# Source:     lazy query [?? x 2]
# Database:   postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
# Ordered by: desc(n)
   adjective                 n
   <chr>                 <dbl>
 1 Full Year              6985
 2 Interim                4496
 3 Preliminary            3342
 4 Half Year              3045
 5 Preliminary Q4           82
 6 Preliminary Q1           64
 7 Preliminary Q3           54
 8 Preliminary Q2           40
 9 Q4                       25
10 ""                       23
11 Q3                       21
12 Full Year Results        18
13 Preliminary Full Year    16
14 Full year                15
15 Q2                       15
16 Q1                       12
17 Preliminary FY           10
18 Half Year Results        10
19 Interim Results           9
20 Half year                 6
# ... with more rows
iangow commented 6 years ago

OK. One approach to identifying problems is to flag cases with more than one event_type==1 call per permno in a given quarter (there can't be more than 5 calls in a year without at least one quarter being a problem).

I put the results of the following code here. It seems that most such cases are fine, but I'd say that more problem cases are created by bad PERMNO matches than by inclusion of "preliminary" cases.

The way to fix the bad PERMNO matches is to do so here. For example, file_name equal to 515127_T has been manually matched to permno==10104L thanks to "Fix by Nastia/Vincent in January 2015", but it should be matched to permno==10078L. (I fixed this one, so it should go away when crsp_link code is run again.)

Some cases where there are more than one call per quarter are legitimate. For example,

library(dplyr, warn.conflicts = FALSE)
library(DBI)
#> Loading required package: DBI
pg <- dbConnect(RPostgreSQL::PostgreSQL())

rs <- dbExecute(pg, "SET search_path TO streetevents, public")
rs <- dbExecute(pg, "SET work_mem = '10GB'")

selected_calls <- tbl(pg, "selected_calls")
calls <- tbl(pg, "calls")
crsp_link <- tbl(pg, "crsp_link")

selected_calls <-
    calls %>%
    group_by(file_name) %>%
    summarize(last_update = max(last_update, na.rm = TRUE)) %>%
    compute(index = "file_name")

selected_calls %>% filter(file_name == "10784087_T")

problems <-
    crsp_link %>%
    select(-matches("matcH_type")) %>%
    filter(!is.na(permno)) %>%
    inner_join(calls) %>%
    semi_join(selected_calls) %>%
    mutate(start_date = sql("start_date::date")) %>%
    filter(event_type == 1L) %>%
    distinct(file_name, last_update, permno, start_date, event_title) %>%
    mutate(quarter = date_part('quarter', start_date),
           year = date_part('year', start_date)) %>%
    group_by(permno, year, quarter) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    select(file_name, last_update, permno, quarter, year, start_date, event_title) %>%
    arrange(permno, year, quarter) %>%
    compute()

library(readr)   
problems %>%
    collect() %>%
    write_csv("~/Google Drive/data/streetevents/mult_in_qtr.csv")
iangow commented 6 years ago

Hopefully addressing "Classify cases with multiple calls in quarter" will make it easy to address this one too.

iangow commented 6 years ago

Moved issue from executive to here, as it has nothing to do with executive.

jingyujzhang commented 6 years ago

@iangow : Mult_in_qtr is done. Each problematic call can be put into permno, multiple, non_earnings, not_sure and preliminary.

For the permno type, I need to correct them in streetevents.manual_permno_matches.

For the other types, are we going to exclude them when calculating streetevents.crsp_link? If not, I need to add more match_types.

iangow commented 6 years ago

@iangow : Mult_in_qtr is done. Each problematic call can be put into permno, multiple, non_earnings, not_sure and preliminary.

For the permno type, I need to correct them in streetevents.manual_permno_matches.

I already did that.

For the other types, are we going to exclude them when calculating streetevents.crsp_link? If not, I need to add more match_types.

Don't do anything yet.

Make sure to pull the latest code from StreetEvents before committing any more code ... I had some issues with the two repositories (streetevents and streetevents_private) getting out of sync and may have complicated the commit history in fixing this.

iangow commented 6 years ago

I will close this one too. I will create a follow-up issue to handle the "problematic calls" flagged in mult_in_qtr. Let me know if anything is urgently depending on this code.