Resolve incorrect matches in se_calls_link

iangow commented 6 years ago

From @azakolyukina on July 5, 2018 23:49

@jingyujzhang (@iangow ):

se_calls_link contains incorrect matches such as below

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))

calls %>%
    filter(file_name %in% c('1060901_T', '1097433_T', '1113564_T', '1146661_T', '1155210_T', '1208085_T')) %>%
    select(file_name, start_date, event_title) %>%
    arrange(start_date)
# # Source:     lazy query [?? x 3]
# # Database:   postgres 9.4.12 [aaz@aaz.chicagobooth.edu:5432/postgres]
# # Ordered by: start_date
# file_name start_date          event_title                                        
# <chr>     <dttm>              <chr>                                              
# 1 1060901_T 2005-05-12 04:00:00 Q1 2006 NVIDIA Corporation Earnings Conference Call
# 2 1097433_T 2005-07-25 04:30:00 Q2 2005 PORTALPLAYER INC Earnings Conference Call  
# 3 1113564_T 2005-08-11 04:00:00 Q2 2006 NVIDIA Corporation Earnings Conference Call
# 4 1146661_T 2005-10-24 04:30:00 Q3 2005 PORTALPLAYER INC Earnings Conference Call  
# 5 1155210_T 2005-11-09 04:00:00 Q3 2006 NVIDIA Corporation Earnings Conference Call
# 6 1208085_T 2006-02-16 04:00:00 Q4 2006 NVIDIA Corporation Earnings Conference Call
# > 

se_call_link %>%
    filter(file_name %in% c('1060901_T', '1097433_T', '1113564_T', '1146661_T', '1155210_T', '1208085_T')) %>%
    arrange(start_date)
# # Source:     lazy query [?? x 6]
# # Database:   postgres 9.4.12 [aaz@aaz.chicagobooth.edu:5432/postgres]
# # Ordered by: start_date
# file_name file_path     start_date          permno company_id fy_end    
# <chr>     <chr>         <dttm>               <int>      <int> <date>    
# 1 1060901_T 1060901_T.xml 2005-05-12 04:00:00  86580        865 2005-01-31
# 2 1097433_T 1097433_T.xml 2005-07-25 04:30:00  86580        865 2005-01-31
# 3 1113564_T 1113564_T.xml 2005-08-11 04:00:00  86580        865 2005-01-31
# 4 1146661_T 1146661_T.xml 2005-10-24 04:30:00  86580        865 2005-01-31
# 5 1155210_T 1155210_T.xml 2005-11-09 04:00:00  86580        865 2005-01-31
# 6 1208085_T 1208085_T.xml 2006-02-16 04:00:00  86580        865 2006-01-31

One way to resolve those, is to follow the logic in 3bb8473 (@jingyujzhang we did talk about this before with you)

The idea is that for multiple matches in {company_id, fisc_qtr_data}, impose restriction on company name. fisc_qtr_data can be extracted as in nonans_rate

Copied from original issue: iangow/executive#60

iangow commented 6 years ago

From @azakolyukina on July 5, 2018 23:51

@jingyujzhang:

Note that both se_calls_link and se_exec_link have to be updated.

iangow commented 6 years ago

@jingyujzhang

@azakolyukina I'm not sure what "the logic in 3bb8473" is.

Below is some code that may help. It probably makes sense to extract the company name from event_title and manually map distinct company names along with cusip or some other identifier to permno and then blow that mapping up into a mapping from file_name to permo and add to se_call_link.

Please check that these aren't in the manual matches that feed into se_call_link. Not sure why these are still there. This is a legacy issue for calls before ~2013 and should not affect more recent calls.

Also we should remove file_path and start_date from se_call_link.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))

se_link <-
    se_call_link %>%
    select(-file_path, -start_date)

calls_mod <- 
    calls %>%
    filter(event_type == 1L) %>%
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))

dupes <- 
    calls_mod %>%
    inner_join(se_link) %>% 
    group_by(company_id, fisc_qtr_data) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    compute()
#> Joining, by = "file_name"

dupes
#> # Source:   table<ywycxfsbfy> [?? x 8]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>    file_name cusip  start_date          event_title   fisc_qtr_data permno
#>    <chr>     <chr>  <dttm>              <chr>         <chr>          <int>
#>  1 666954_T  11162… 2002-10-23 19:00:00 Q3 2002 Foun… 20023          86881
#>  2 651583_T  11162… 2002-08-14 18:30:00 Q3 2002 Broc… 20023          86881
#>  3 696314_T  11162… 2003-01-29 21:00:00 Q4 2002 Foun… 20024          86881
#>  4 675859_T  11162… 2002-11-21 20:30:00 Q4 2002 Broc… 20024          86881
#>  5 707175_T  11162… 2003-02-12 20:30:00 Q1 2003 Broc… 20031          86881
#>  6 723669_T  11162… 2003-04-23 19:00:00 Q1 2003 Foun… 20031          86881
#>  7 763437_T  11162… 2003-07-23 19:00:00 Q2 2003 Foun… 20032          86881
#>  8 734682_T  11162… 2003-05-14 19:30:00 Q2 2003 Broc… 20032          86881
#>  9 796184_T  11162… 2003-10-22 19:00:00 Q3 2003 Foun… 20033          86881
#> 10 769964_T  11162… 2003-08-13 19:00:00 Q3 2003 Broc… 20033          86881
#> # ... with more rows, and 2 more variables: company_id <int>,
#> #   fy_end <date>

dupes %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  6330

dupes %>% 
    distinct(company_id) %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   495

dupes %>% 
    group_by(company_id) %>%
    count() %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups:     company_id
#> # Ordered by: desc(n)
#>    company_id     n
#>         <int> <dbl>
#>  1       3897   125
#>  2       2874   114
#>  3       4215   103
#>  4       3970    89
#>  5       4427    72
#>  6       4063    69
#>  7       4241    66
#>  8       4930    62
#>  9       3478    58
#> 10       4604    56
#> # ... with more rows

Created on 2018-07-06 by the reprex package (v0.2.0).

iangow commented 6 years ago

From @azakolyukina on July 6, 2018 0:30

@azakolyukina I'm not sure what "the logic in 3bb8473" is.

To resolve incorrect matches, I compare company name in Equilar with company name in event_title. This is how we discovered the mess in streetevents tickers! I do NOT use manual mapping in 3bb8473---just take a careful look.

iangow commented 6 years ago

You must be referring to a different commit, I don't see anything about company names in https://github.com/azakolyukina/nonans_product/commit/3bb8473b509256c8179e57c671c30535cd705568.

iangow commented 6 years ago

This is how we discovered the mess in StreetEvents tickers!

But then we did a huge manual-matching exercise and we should use the results of that (none of these cases should be new). I don't know why we still have cases like this.

iangow commented 6 years ago

From @azakolyukina on July 6, 2018 0:51

You must be referring to a different commit, I don't see anything about company names in azakolyukina/nonans_product@3bb8473.

This is very strange. Your are right! I am trying to recover and understand how it even happened that I lost my original code.

iangow commented 6 years ago

From @azakolyukina on July 6, 2018 0:52

I don't know why we still have cases like this.

..probably because I have a tendency to find bugs

iangow commented 6 years ago

OK. Not clear why these bugs weren't found in 2014. Note that these issues need to be fixed in streetevents.crsp_link, not here. Maybe that's where you have the code.

iangow commented 6 years ago

From @azakolyukina on July 6, 2018 0:58

(@jingyujzhang In the above, I am referring to this "logic")

Here, I found it nonans_product/create_data/create_se_link.R follow comparisons with

    mutate(is_aka_match = (call_desc %~*% aka_company_name), 
           is_preliminary = (call_desc %~*% 'preliminary')) %>%

iangow commented 6 years ago

From @azakolyukina on July 6, 2018 1:3

Note that these issues need to be fixed in streetevents.crsp_link, not here.

@jingyujzhang: please follows @iangow instructions and fix it in crsp_link. I think my approach above will still apply.

iangow commented 6 years ago

You don't need Equilar to see these issues:

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
crsp_link <- tbl(pg, sql("SELECT * FROM streetevents.crsp_link"))

se_link <- 
    crsp_link %>%
    filter(!is.na(permno)) %>%
    distinct(file_name, permno)

calls_mod <- 
    calls %>%
    filter(event_type == 1L) %>%
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))

dupes <- 
    calls_mod %>%
    inner_join(se_link) %>% 
    group_by(permno, fisc_qtr_data) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    compute()
#> Joining, by = "file_name"

dupes
#> # Source:   table<vlwustzbke> [?? x 6]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>    file_name cusip  start_date          event_title   fisc_qtr_data permno
#>    <chr>     <chr>  <dttm>              <chr>         <chr>          <int>
#>  1 562913_T  68389… 2001-12-13 21:30:00 Q2 2002 Orac… 20022          10104
#>  2 646764_T  68389… 2002-07-18 19:00:00 Q2 2002 Peop… 20022          10104
#>  3 645421_T  68389… 2002-07-23 19:00:00 Q2 2002 SeeB… 20022          10104
#>  4 640000_T  68389… 2002-07-17 13:00:00 Q2 2002 Rete… 20022          10104
#>  5 619670_T  68389… 2002-05-22 19:00:00 Q2 2002 J.D.… 20022          10104
#>  6 603385_T  68389… 2002-03-14 21:30:00 Q3 2002 Orac… 20023          10104
#>  7 667639_T  68389… 2002-10-17 19:00:00 Q3 2002 Peop… 20023          10104
#>  8 666050_T  68389… 2002-10-23 19:00:00 Q3 2002 SeeB… 20023          10104
#>  9 696301_T  68389… 2003-01-22 21:00:00 Q4 2002 Peop… 20024          10104
#> 10 695782_T  68389… 2003-01-28 21:00:00 Q4 2002 SeeB… 20024          10104
#> # ... with more rows

dupes %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  7002

dupes %>% 
    distinct(permno) %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   621

dupes %>% 
    group_by(permno) %>%
    count() %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups:     permno
#> # Ordered by: desc(n)
#>    permno     n
#>     <int> <dbl>
#>  1  21178   125
#>  2  85789   114
#>  3  10104   103
#>  4  81061    89
#>  5  50017    72
#>  6  22779    69
#>  7  35051    68
#>  8  75819    62
#>  9  90441    61
#> 10  37584    58
#> # ... with more rows

Created on 2018-07-06 by the reprex package (v0.2.0).

iangow commented 6 years ago

@jingyujzhang

If you can fill out this spreadsheet, then I think it should be easy to use that data to resolve these issues.

iangow commented 6 years ago

Code to create the spreadsheet above:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
library(readr)

pg <- dbConnect(RPostgreSQL::PostgreSQL())
rs <- dbExecute(pg, "SET work_mem = '10GB'")
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
crsp_link <- tbl(pg, sql("SELECT * FROM streetevents.crsp_link"))
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))

crsp_names <-
    stocknames %>%
    distinct(permno, comnam) %>%
    group_by(permno) %>%
    summarize(comnams = array_agg(comnam)) %>%
    compute()

se_link <- 
    crsp_link %>%
    filter(!is.na(permno)) %>%
    distinct(file_name, permno)
# |Q&A Session
regex <- "(?:Earnings(?: Conference Call)?|Financial and Operating Results|Financial Results Call|"
regex <- paste0(regex, "Results Conference Call|Analyst Meeting)")
regex <- paste0("^(.*) (", regex, ")")

qtr_regex <- "(Preliminary Half Year|Full Year|Q[1-4])"
year_regex <- "(20[0-9]{2}(?:-[0-9]{2}|/20[0-9]{2})?)"
period_regex <- paste0("^", qtr_regex, " ", year_regex," (.*)")
calls_mod <- 
    calls %>%
    filter(event_type == 1L) %>%
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, period_regex)) %>%
    mutate(event_co_name = sql("fisc_qtr_data[3]")) %>%
    mutate(event_co_name = regexp_matches(event_co_name, regex)) %>%
    mutate(event_co_name = sql("event_co_name[1]"),
           event_desc = sql("event_co_name[2]")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[1] || ' ' || fisc_qtr_data[2]")) %>%
    compute()

calls %>%
    filter(event_type == 1L) %>% 
    anti_join(calls_mod, by = c("file_name")) %>% s

dupes <- 
    calls_mod %>%
    inner_join(se_link) %>% 
    group_by(permno, fisc_qtr_data) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    compute()

dupes

dupes %>% 
    count()

dupes %>% 
    distinct(permno) %>%
    count()

dupes %>% 
    group_by(permno) %>%
    count() %>%
    arrange(desc(n))

dupes %>%
    inner_join(crsp_names) %>%
    distinct(event_co_name, permno, comnams) %>%
    arrange(permno) %>%
    collect() %>%
    write_csv("~/Google Drive/data/streetevents/name_checks.csv")

iangow commented 6 years ago

From @jingyujzhang on July 6, 2018 5:44

@iangow: For merges / reorganizations, the old companies are not the same as the new company (i.e. choose FALSE in the google sheet), right?

iangow commented 6 years ago

Unless the name (or an obvious variant thereon) appears in the list of names, then FALSE.

iangow commented 6 years ago

If in doubt, make a comment with +iandgow@gmail.com and I will respond.

iangow commented 6 years ago

From @jingyujzhang on July 6, 2018 7:14

@iangow: I'm done with the google sheet and I've inserted comments as well.

iangow commented 6 years ago

From @azakolyukina on July 25, 2018 19:46

@iangow and @jingyujzhang:

Are you done with this issue?

iangow commented 6 years ago

From @jingyujzhang on July 25, 2018 20:14

Yes, this issue can be closed.

iangow commented 6 years ago

We should not close the issue until we have confirmed that the problem it is intended to solve has gone away. It seems it hasn't. Have we incorporated the data we collected into the relevant table?

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link"))

calls_mod <- 
    calls %>%
    filter(event_type == 1L) %>%
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]"))

dupes <- 
    calls_mod %>%
    inner_join(se_call_link) %>% 
    group_by(company_id, fisc_qtr_data) %>%
    filter(n() > 1) %>%
    ungroup() %>%
    compute()
#> Joining, by = "file_name"

dupes
#> # Source:   table<yhfkxubjvv> [?? x 8]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>    file_name cusip  start_date          event_title   fisc_qtr_data permno
#>    <chr>     <chr>  <dttm>              <chr>         <chr>          <int>
#>  1 3645877_T 12618… 2011-01-13 13:00:00 Q4 2010 Char… 20104          85972
#>  2 3645877_T 12618… 2011-01-13 13:00:00 Q4 2010 Char… 20104          85972
#>  3 3967370_T 12618… 2011-04-28 11:00:00 Q1 2011 CRA … 20111          85972
#>  4 3967370_T 12618… 2011-04-28 11:00:00 Q1 2011 CRA … 20111          85972
#>  5 4156440_T 12618… 2011-07-28 11:00:00 Q2 2011 CRA … 20112          85972
#>  6 4156440_T 12618… 2011-07-28 11:00:00 Q2 2011 CRA … 20112          85972
#>  7 4218416_T 12618… 2011-10-27 12:00:00 Q3 2011 CRA … 20113          85972
#>  8 4218416_T 12618… 2011-10-27 12:00:00 Q3 2011 CRA … 20113          85972
#>  9 1992773_T 31561… 2008-10-07 23:30:00 Q4 2008 F5 N… 20084          86964
#> 10 1998857_T 31561… 2008-10-22 19:30:00 Q4 2008 F5 N… 20084          86964
#> # ... with more rows, and 2 more variables: company_id <int>,
#> #   fy_end <date>

dupes %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1  1346

dupes %>% 
    distinct(company_id) %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   270

dupes %>% 
    group_by(company_id) %>%
    count() %>%
    arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres 9.4.12 [igow@aaz.chicagobooth.edu:5432/postgres]
#> # Groups:     company_id
#> # Ordered by: desc(n)
#>    company_id     n
#>         <int> <dbl>
#>  1       2874   114
#>  2       3478    58
#>  3       4633    54
#>  4      20591    48
#>  5      40600    24
#>  6       3605    22
#>  7       9219    22
#>  8      11241    18
#>  9       5649    16
#> 10      36240    16
#> # ... with more rows

Created on 2018-07-26 by the reprex package (v0.2.0).

iangow commented 6 years ago

It may be that the code for dupes above needs refinement to exclude cases that are not really problems. But in any case, we should confirm that all is OK.

iangow commented 6 years ago

From @azakolyukina on July 26, 2018 16:3

But in any case, we should confirm that all is OK.

@jingyujzhang: Could you please confirm that you have resolved that.

iangow commented 6 years ago

From @jingyujzhang on July 26, 2018 20:30

@iangow (cc: @azakolyukina ): The code above is problematic. Streetevent.calls table can't be used directly, we have computed streetevent.selected_calls before.

It's possible for a (company_id, fisc_qtr_data) to have more than 1 files according to calls records.
The remaining 336 dupes below are due to change of fy_end. If a company changed its fiscal year end at some point, the records in proxy_company where we get fy_end can have 2 records within a year (See the below example of company_id = 2517). I think it's OK to leave these dups as they are.

> proxy_company %>% 
+     filter(company_id == 2517) %>% 
+     arrange(fy_end)
# Source:     lazy query [?? x 2]
# Database:   postgres 9.4.12 [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
# Ordered by: fy_end
   company_id fy_end    
        <int> <date>    
 1       2517 2000-09-30
 2       2517 2001-09-30
 3       2517 2002-09-30
 4       2517 2002-12-31
 5       2517 2003-12-31

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link")) 
selected_calls <- tbl(pg, sql("select * from public.selected_calls"))

calls_mod <- # file_name, cusip, start_date, event_title, fisc_qtr_data
    selected_calls %>%
    inner_join(calls) %>% 
    group_by(file_name) %>% 
    filter(last_update == max(last_update, na.rm = TRUE)) %>% 
    ungroup() %>% 
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]")) %>% 
    compute()
#> Joining, by = c("file_name", "file_path")

dupes <- 
    calls_mod %>%
    inner_join(se_call_link) %>% # by = "file_name"
    group_by(company_id, fisc_qtr_data, file_name) %>%  # <- Add file_name here 
    filter(n() > 1) %>%
    ungroup() %>%
    arrange(file_name, fisc_qtr_data, cusip, start_date, fy_end) %>% 
    select(file_name, fisc_qtr_data, cusip, start_date, fy_end, everything()) %>% 
    compute()
#> Joining, by = "file_name"

dupes %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#>       n
#>   <dbl>
#> 1   336

Created on 2018-07-26 by the reprex package (v0.2.0).

iangow commented 6 years ago

@jingyujzhang

I made some comments on the commit.

We should edit the "test code" to use the correct table. Also it should be pretty easy to address issues created by short fiscal years. We should aim to eliminate unidentified issues, then close the issue. (These cases slipped through the cracks back in late 2014!)

iangow commented 6 years ago

Prerequisite for https://github.com/azakolyukina/nonans_product/issues/90

iangow commented 6 years ago

From @jingyujzhang on July 29, 2018 9:41

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgreSQL::PostgreSQL())

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
se_call_link <- tbl(pg, sql("SELECT * FROM executive.se_call_link")) 
selected_calls <- tbl(pg, sql("select * from streetevents.selected_calls"))

calls_mod <- # file_name, cusip, start_date, event_title, fisc_qtr_data
    selected_calls %>%
    inner_join(calls) %>% 
    group_by(file_name) %>% 
    filter(last_update == max(last_update, na.rm = TRUE)) %>% 
    ungroup() %>% 
    distinct(file_name, cusip, start_date, event_title) %>%
    mutate(fisc_qtr_data = regexp_matches(event_title, "^Q([1-4]) (20[0-9]{2})")) %>%
    mutate(fisc_qtr_data = sql("fisc_qtr_data[2] || fisc_qtr_data[1]")) %>% 
    compute()
#> Joining, by = c("file_name", "file_path")

dupes <- 
    calls_mod %>%
    inner_join(se_call_link) %>% # by = "file_name"
    group_by(company_id, fisc_qtr_data, file_name) %>%  # <- Add file_name here 
    filter(n() > 1) %>%
    ungroup() %>%
    arrange(file_name, fisc_qtr_data, cusip, start_date, fy_end) %>% 
    select(file_name, fisc_qtr_data, cusip, start_date, fy_end, everything()) %>% 
    compute()
#> Joining, by = "file_name"

dupes
#> # Source:     lazy query [?? x 8]
#> # Database:   postgres 9.4.12
#> #   [jingyuzhang@aaz.chicagobooth.edu:5432/postgres]
#> # Ordered by: file_name, fisc_qtr_data, cusip, start_date, fy_end

Created on 2018-07-29 by the reprex package (v0.2.0).

iangow commented 6 years ago

Why public.selected_calls? Is this not in StreetEvents?

iangow commented 6 years ago

From @jingyujzhang on July 29, 2018 9:48

I updated the last comment. selected_calls is in streetevents.

se_call_link, se_exec_link, crsp_link, selected_calls are up to date.

iangow commented 6 years ago

@jingyujzhang

@azakolyukina closed https://github.com/azakolyukina/nonans_product/issues/90 but not this issue.but I think that issue depends on this one, which is still open.

iangow commented 6 years ago

From @azakolyukina on July 29, 2018 22:51

@iangow:

@azakolyukina closed azakolyukina/nonans_product#90 but not this issue.but I think that issue depends on this one, which is still open.

I have not found new (obvious) glitches for https://github.com/azakolyukina/nonans_product/issues/90. However, the table se_call_link can still be improved based on a glitched that surfaced up and was corrected for https://github.com/azakolyukina/nonans_product/issues/90

iangow commented 6 years ago

@jingyujzhang Can you make sense of this issue? I am not sure what this is about. I don't know what the "glitch that surfaced … and was corrected for azakolyukina/nonans_product#90" is.

jingyujzhang commented 6 years ago

@iangow : It's related to the consistency check Nastia did there - to make sure no more than 1 calls per fisc_qtr_data, no more than 5 calls per fiscal year (manually check those existed 5 calls to see whether the extra calls make sense), 1 file_name matches 1 company_id, etc.

Previously, se_call_link didn't pass the consistency check, but I modified it. se_link table is based on the modified version of se_call_link and it has been finalized. Even if se_call_link becomes more perfect in the future, se_link won't change.

iangow commented 6 years ago

There is no se_link or se_calls_link table here. Where are these tables? Where is the code that creates them?

iangow commented 6 years ago

In executive, I see these:

postgres=# \dt+ executive.se_*
                                                                 List of relations
  Schema   |       Name       | Type  |   Owner   |  Size   |                                      Description                                      
-----------+------------------+-------+-----------+---------+---------------------------------------------------------------------------------------
 executive | se_call_link     | table | executive | 7080 kB | CREATED USING create_se_call_link.R ON 2018-08-15 10:46:14
 executive | se_call_link_old | table | executive | 7464 kB | CREATED USING create_se_call_link.R ON 2018-07-29 18:08:52
 executive | se_exec_link     | table | executive | 25 MB   | CREATED USING create_se_exec_link.R IN GitHub iangow/executive ON 2018-08-15 11:25:39
 executive | se_exec_link_old | table | executive | 27 MB   | CREATED USING create_se_exec_link.R IN GitHub iangow/executive ON 2018-07-29 18:13:02
(4 rows)

But no se_link. If we are not using the _old tables, we should probably delete them.

jingyujzhang commented 6 years ago

nonans_product.se_link.

executive.se_call_link.

Old tables are there for comparison in case I miss anything. I deleted them.

iangow commented 6 years ago

OK. I will close this issue. I will revisit executive.se_call_link at a later date.

iangow / streetevents_private

Resolve incorrect matches in se_calls_link #41