iangow / streetevents_private

2 stars 1 forks source link

Identify issues in StreetEvents-CRSP links. #1

Closed iangow closed 7 years ago

iangow commented 8 years ago

@rudyardrichter:

(CC @ azakolyukina)

Do you have an account (e.g., Gmail address) through which I can share a Google Sheets document?

-Ian

iangow commented 8 years ago

Another illustration. First identify some problematic cases:

> crsp_link %>% 
+     inner_join(permcos) %>% 
+     inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+     filter(permco.x != permco.y) %>%
+     collect() %>%
+     anti_join(investigation)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 10 x 7
   file_name permno.x match_type                                                 match_type_desc permco.x permno.y permco.y
       <chr>    <int>      <int>                                                           <chr>    <int>    <int>    <int>
1  3930276_T    81677          5 5. Match on ticker and exact name Soundex between company dates    30913    82107     2253
2  4150640_T    81677          5 5. Match on ticker and exact name Soundex between company dates    30913    82107     2253
3  4212178_T    81677          5 5. Match on ticker and exact name Soundex between company dates    30913    82107     2253
4  3952476_T    48389          7      7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
5  4153751_T    48389          7      7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
6  4212164_T    48389          7      7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
7  4723832_T    48389          7      7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
8  1670264_T    12785          2                  2. Roll matches back & forward in StreetEvents    53766    79477     5981
9   611752_T    93179          2                  2. Roll matches back & forward in StreetEvents    53305    70965    21407
10 1294427_T    19828          7      7. Match ticker & fuzzy name Soundex between company dates    20550    63060     4961

Let's take the first one. This does not appear to be a case of a name change:

> calls %>% filter(file_name=='3930276_T') %>% select(call_desc, call_date, ticker, co_name)
Source:   query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

                                                                  call_desc           call_date ticker                                  co_name
                                                                      <chr>              <time>  <chr>                                    <chr>
1 Q1 2011 Westinghouse Air Brake Technologies Corp Earnings Conference Call 2011-04-26 14:00:00    WAB Westinghouse Air Brake Technologies Corp

But it does seem that the permno on crsp_link (82107) is wrong:

> stocknames %>% filter(permno==81677L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                        comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <int>  <int>     <date>     <date>    <chr>    <chr>  <chr>                         <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  81677  30913 1995-06-16 2000-05-01 92974010 96038610    WAB WESTINGHOUSE AIR BRAKE CO NEW     1      1  3743    11   <NA> 1995-06-30 2014-06-30       2
2  81677  30913 2000-05-02 2014-06-30 92974010 92974010    WAB                   WABTEC CORP     1      1  3743    11   <NA> 1995-06-30 2014-06-30       2
> stocknames %>% filter(permno==82107L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                      comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <int>  <int>     <date>     <date>    <chr>    <chr>  <chr>                       <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  82107   2253 1976-10-15 1983-06-30 95709010 45384010   IBCX INDEPENDENT BANKSHARES CORP     3      3     0    11   <NA> 1976-10-29 2014-06-30       2
2  82107   2253 1983-07-01 1987-01-08 95709010 95709010   WSAM  WESTAMERICA BANCORPORATION     3      3  6711    11   <NA> 1976-10-29 2014-06-30       2
3  82107   2253 1987-01-09 1994-04-25 95709010 95709010    WAB  WESTAMERICA BANCORPORATION     3      2  6025    11   <NA> 1976-10-29 2014-06-30       2
4  82107   2253 1994-04-26 2014-06-30 95709010 95709010   WABC  WESTAMERICA BANCORPORATION     3      3  6060    11   <NA> 1976-10-29 2014-06-30       2

One way to fix the issues is to get the file_name values and (with a little clean-up in Vim), dump them in the manual match sheet with the correct permno:

> crsp_link %>% filter(permno==82107L) %>% inner_join(calls) %>% select(file_name, co_name) %>% print(n=100)
Joining, by = "file_name"
Source:   query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

   file_name                                  co_name
       <chr>                                    <chr>
1  1009827_T Westinghouse Air Brake Technologies Corp
2  1051816_T Westinghouse Air Brake Technologies Corp
3  1096519_T Westinghouse Air Brake Technologies Corp
4  1151609_T Westinghouse Air Brake Technologies Corp
5  1206648_T Westinghouse Air Brake Technologies Corp
6  1269221_T Westinghouse Air Brake Technologies Corp
7  1351220_T Westinghouse Air Brake Technologies Corp
8  1396331_T Westinghouse Air Brake Technologies Corp
9  1477526_T Westinghouse Air Brake Technologies Corp
10 1525059_T Westinghouse Air Brake Technologies Corp
11 1599942_T Westinghouse Air Brake Technologies Corp
12 1633991_T Westinghouse Air Brake Technologies Corp
13 1747872_T Westinghouse Air Brake Technologies Corp
14 1817672_T Westinghouse Air Brake Technologies Corp
15 1889049_T Westinghouse Air Brake Technologies Corp
16 1984698_T Westinghouse Air Brake Technologies Corp
17 2075711_T Westinghouse Air Brake Technologies Corp
18 2151105_T Westinghouse Air Brake Technologies Corp
19 2301735_T Westinghouse Air Brake Technologies Corp
20 2467790_T Westinghouse Air Brake Technologies Corp
21 2767534_T Westinghouse Air Brake Technologies Corp
22 3029582_T Westinghouse Air Brake Technologies Corp
23 3215279_T Westinghouse Air Brake Technologies Corp
24 3431761_T Westinghouse Air Brake Technologies Corp
25 3733170_T Westinghouse Air Brake Technologies Corp
26 3930276_T Westinghouse Air Brake Technologies Corp
27 4150640_T Westinghouse Air Brake Technologies Corp
28 4212178_T Westinghouse Air Brake Technologies Corp
29 4729068_T Westinghouse Air Brake Technologies Corp
30 4785250_T Westinghouse Air Brake Technologies Corp
31 4863167_T Westinghouse Air Brake Technologies Corp
32 4926043_T Westinghouse Air Brake Technologies Corp
33 5011231_T Westinghouse Air Brake Technologies Corp
34 5058074_T Westinghouse Air Brake Technologies Corp
35 5125362_T Westinghouse Air Brake Technologies Corp
36 5198092_T Westinghouse Air Brake Technologies Corp
37 5282746_T Westinghouse Air Brake Technologies Corp
38 5343513_T Westinghouse Air Brake Technologies Corp
39 5434590_T Westinghouse Air Brake Technologies Corp
40 5507083_T Westinghouse Air Brake Technologies Corp
41 5616495_T Westinghouse Air Brake Technologies Corp
42 5679876_T Westinghouse Air Brake Technologies Corp
43 5766371_T Westinghouse Air Brake Technologies Corp
44 5830846_T Westinghouse Air Brake Technologies Corp
45 5910159_T Westinghouse Air Brake Technologies Corp
46 5983177_T Westinghouse Air Brake Technologies Corp
47 5983177_T Westinghouse Air Brake Technologies Corp
48  617198_T Westinghouse Air Brake Technologies Corp
49  641477_T Westinghouse Air Brake Technologies Corp
50  661789_T Westinghouse Air Brake Technologies Corp
51  689053_T Westinghouse Air Brake Technologies Corp
52  710788_T Westinghouse Air Brake Technologies Corp
53  731219_T Westinghouse Air Brake Technologies Corp
54  763408_T Westinghouse Air Brake Technologies Corp
55  796167_T Westinghouse Air Brake Technologies Corp
56  840472_T               Westamerica Bancorporation
57  845360_T Westinghouse Air Brake Technologies Corp
58  875906_T Westinghouse Air Brake Technologies Corp
59  917665_T Westinghouse Air Brake Technologies Corp
60  952268_T Westinghouse Air Brake Technologies Corp

The problem I see here is that I have no idea how Vincent (prior RA) identified and fixed these cases.

iangow commented 8 years ago

OK. Let's do one more. Here the PERMNO should be 80912.

> crsp_link %>% 
+     inner_join(permcos) %>% 
+     inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+     filter(permco.x != permco.y) %>%
+     collect() %>%
+     anti_join(investigation)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 10 x 7
   file_name permno.x match_type                                            match_type_desc permco.x permno.y permco.y
       <chr>    <dbl>      <int>                                                      <chr>    <dbl>    <dbl>    <dbl>
1  3930276_T    81677          0                                          0. Manual matches    30913    82107     2253
2  4150640_T    81677          0                                          0. Manual matches    30913    82107     2253
3  4212178_T    81677          0                                          0. Manual matches    30913    82107     2253
4  3952476_T    48389          7 7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
5  4153751_T    48389          7 7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
6  4212164_T    48389          7 7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
7  4723832_T    48389          7 7. Match ticker & fuzzy name Soundex between company dates    21786    80912    27333
8  1670264_T    12785          2             2. Roll matches back & forward in StreetEvents    53766    79477     5981
9   611752_T    93179          2             2. Roll matches back & forward in StreetEvents    53305    70965    21407
10 1294427_T    19828          7 7. Match ticker & fuzzy name Soundex between company dates    20550    63060     4961
> calls %>% filter(file_name=='3952476_T') %>% select(call_desc, call_date, ticker, co_name)
Source:   query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

                                             call_desc           call_date ticker            co_name
                                                 <chr>              <time>  <chr>              <chr>
1 Q1 2011 TGC Industries  Inc Earnings Conference Call 2011-05-02 13:30:00    TGE TGC Industries Inc
> stocknames %>% filter(permno==48389L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                   comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <dbl>  <dbl>     <date>     <date>    <chr>    <chr>  <chr>                    <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  48389  21786 1969-05-21 1979-05-09 90311910 89881310    TGE TUCSON GAS & ELECTRIC CO     1      1  4911    11   <NA> 1969-05-29 2014-08-29       2
2  48389  21786 1979-05-10 1996-05-19 90311910 89881310    TEP TUCSON ELECTRIC POWER CO     1      1  4911    11   <NA> 1969-05-29 2014-08-29       2
3  48389  21786 1996-05-20 1998-01-01 90311910 89881370    TEP TUCSON ELECTRIC POWER CO     1      1  4911    11   <NA> 1969-05-29 2014-08-29       2
4  48389  21786 1998-01-02 2012-05-13 90311910 90920510    UNS    UNISOURCE ENERGY CORP     1      1  4911    11   <NA> 1969-05-29 2014-08-29       2
5  48389  21786 2012-05-14 2014-08-15 90311910 90311910    UNS        U N S ENERGY CORP     1      1  4911    11   <NA> 1969-05-29 2014-08-29       2
> stocknames %>% filter(permno==80912L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                    comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <dbl>  <dbl>     <date>     <date>    <chr>    <chr>  <chr>                     <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  80912  27333 1994-09-22 1998-11-08 23936010 87241710   TGCI      T G C INDUSTRIES INC     3      3  2670    11   <NA> 1994-09-30 2016-06-30       2
2  80912  27333 1998-11-09 1998-11-26 23936010 87241730   TGDC      T G C INDUSTRIES INC     3      3  2670    11   <NA> 1994-09-30 2016-06-30       2
3  80912  27333 1998-11-27 2002-05-23 23936010 87241730   TGCI      T G C INDUSTRIES INC     3      3  2670    11   <NA> 1994-09-30 2016-06-30       2
4  80912  27333 2002-05-24 2005-04-17 23936010 87241730   <NA>      T G C INDUSTRIES INC     3      0  2670    11   <NA> 1994-09-30 2016-06-30       2
5  80912  27333 2005-04-18 2007-11-05 23936010 87241730    TGE      T G C INDUSTRIES INC     3      2  1382    11   <NA> 1994-09-30 2016-06-30       2
6  80912  27333 2007-11-06 2015-02-11 23936010 87241730    TGE      T G C INDUSTRIES INC     3      3  1382    11   <NA> 1994-09-30 2016-06-30       2
7  80912  27333 2015-02-12 2016-06-30 23936010 23936010   DWSN DAWSON GEOPHYSICAL CO NEW     3      3  1382    11   <NA> 1994-09-30 2016-06-30       2

So, again I just add rows to the manual match spreadsheet:

> crsp_link %>% filter(permno==48389L) %>% inner_join(calls) %>% filter(co_name ~ 'TGC') %>% select(file_name, co_name) %>% print(n=100)
Joining, by = "file_name"
Source:   query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

   file_name            co_name
       <chr>              <chr>
1  1357818_T TGC Industries Inc
2  1400594_T TGC Industries Inc
3  1467719_T TGC Industries Inc
4  1530148_T TGC Industries Inc
5  1598763_T TGC Industries Inc
6  1666748_T TGC Industries Inc
7  1759897_T TGC Industries Inc
8  1819060_T TGC Industries Inc
9  1896145_T TGC Industries Inc
10 1996881_T TGC Industries Inc
11 2089129_T TGC Industries Inc
12 2161614_T TGC Industries Inc
13 2315390_T TGC Industries Inc
14 2498521_T TGC Industries Inc
15 2733525_T TGC Industries Inc
16 3019959_T TGC Industries Inc
17 3225171_T TGC Industries Inc
18 3431032_T TGC Industries Inc
19 3736405_T TGC Industries Inc
20 3914581_T TGC Industries Inc
21 3952476_T TGC Industries Inc
22 4153751_T TGC Industries Inc
23 4212164_T TGC Industries Inc
24 4690883_T TGC Industries Inc
25 4723832_T TGC Industries Inc
26 4783332_T TGC Industries Inc
27 4837288_T TGC Industries Inc
28 4859917_T TGC Industries Inc
29 4920985_T TGC Industries Inc
30 4997824_T TGC Industries Inc
31 5054918_T TGC Industries Inc
32 5116548_T TGC Industries Inc
33 5127691_T TGC Industries Inc
34 5198142_T TGC Industries Inc
35 5285673_T TGC Industries Inc
> source('db_matches/import_manual_permno_matches.R')
Sheet successfully identified: "streetevents.manual_permno_matches"
Accessing worksheet titled 'manual_permno_matches'.
Downloading: 29 kB      
No encoding supplied: defaulting to UTF-8.
> system("psql -f db_matches/crsp_link.sql")
DROP TABLE
SELECT 275005
ALTER TABLE
CREATE INDEX
iangow commented 8 years ago

So these would be additional cases to investigate (after working through the "investigation" sheet) along the lines of what I did for "Westinghouse Air Brake Technologies Corp" and "TGC Industries Inc" above:

> crsp_link %>% 
+     inner_join(permcos) %>% 
+     inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+     filter(permco.x != permco.y) %>%
+     collect() %>%
+     anti_join(investigation) %>%
+     filter(match_type !=0)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 3 x 7
  file_name permno.x match_type                                            match_type_desc permco.x permno.y permco.y
      <chr>    <dbl>      <int>                                                      <chr>    <dbl>    <dbl>    <dbl>
1  611752_T    93179          2             2. Roll matches back & forward in StreetEvents    53305    70965    21407
2 1294427_T    19828          7 7. Match ticker & fuzzy name Soundex between company dates    20550    63060     4961
3 1670264_T    12785          2             2. Roll matches back & forward in StreetEvents    53766    79477     5981
iangow commented 8 years ago

We need to resolve these cases in some way. If the matched permno is correct, set investigate to FALSE, investigated to TRUE, and add a comment. Otherwise, perhaps add to manual matches (and set investigate to FALSE, investigated to TRUE).

rudyardrichter commented 8 years ago

Regarding the investigation cases in name_matches.csv: I still do not understand (either for Type 1 or Type 2 cases) what it means to verify that the permno is correct. All of the Type 1 Cases have the same permno between crsp_link and company_link. Should I simply check the company name in stocknames corresponding to the shared permno to see if it matches the co_name, or the call_co_name, or check the ticker? I am just looking for a procedural way to verify the permno for a given case.

After looking through the Type 1 cases, it seems that virtually all the cases, according to stocknames, have the same permno across two or more different company names which changed over time (and with matching namedt/nameenddt, with these different names resulting in the discrepancy between co_name and call_co_name. If it is simply a matter of the shared permno describing the same company in stocknames, then I will be able to go through and set investigate=FALSE, investigated=TRUE for all the Type 1 cases.

I have also started working on some of the extra cases to investigate (generated as below) by adding the correct permno to the manual match sheet. It might be helpful if you could perhaps check one of them to make sure that I have the right idea with how to handle these cases. For example:

> crsp_link %>%
+     inner_join(permcos) %>%
+     inner_join(company_link %>% inner_join(permcos), by="file_name",
+                suffix=c(".crsp", ".comp")) %>%
+     filter(permco.crsp != permco.comp) %>%
+     collect() %>%
+     anti_join(investigation) %>%
+     filter(match_type !=0)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 3 × 7
  file_name permno.crsp match_type                                            match_type_desc permco.crsp permno.comp permco.comp
      <chr>       <dbl>      <int>                                                      <chr>       <dbl>       <dbl>       <dbl>
1 1670264_T       12785          2             2. Roll matches back & forward in StreetEvents       53766       79477        5981
2  611752_T       93179          2             2. Roll matches back & forward in StreetEvents       53305       70965       21407
3 1294427_T       19828          7 7. Match ticker & fuzzy name Soundex between company dates       20550       63060        4961
> calls %>% filter(file_name=='1670264_T')
Source:   query [?? x 9]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]

                                                             file_path file_name ticker                 co_name                                           call_desc           call_date      city call_type         last_update
                                                                 <chr>     <chr>  <chr>                   <chr>                                               <chr>              <dttm>     <chr>     <int>              <dttm>
1 StreetEvents_historical_backfill_through_May2013/dir_1/1670264_T.xml 1670264_T    UAM Universal American Corp Q3 2007 Universal American Earnings Conference Call 2007-11-02 13:30:00 Rye Brook         1 2007-11-02 14:47:30
> stocknames %>% filter(permno==12785)
Source:   query [?? x 16]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                      comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <dbl>  <dbl>     <date>     <date>    <chr>    <chr>  <chr>                       <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  12785  53766 2011-05-02 2016-06-30 91338E10 91338E10    UAM UNIVERSAL AMERICAN CORP NEW     1      1  6311    11   <NA> 2011-05-31 2016-06-30       2
> stocknames %>% filter(permno==79477)
Source:   query [?? x 16]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                          comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <dbl>  <dbl>     <date>     <date>    <chr>    <chr>  <chr>                           <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  79477   5981 1983-05-12 1996-07-22 91337710 91359010   UHCO          UNIVERSAL HOLDING CORP     1      3  6310    11   <NA> 1983-05-31 2011-04-29       2
2  79477   5981 1996-07-23 2007-12-02 91337710 91337710   UHCO UNIVERSAL AMERICAN FINANCIAL CO     1      3  6310    11   <NA> 1983-05-31 2011-04-29       2
3  79477   5981 2007-12-03 2011-04-29 91337710 91337710    UAM         UNIVERSAL AMERICAN CORP     1      1  6311    11   <NA> 1983-05-31 2011-04-29       2

Here I have added file_name=1670264_T, permno=79477, co_name=Universal American Financial Corp to the manual matches.

iangow commented 8 years ago

It seems notifications of updates here have been going into my junk folder for the last five weeks. Let me look at these and get back to you.

iangow commented 8 years ago

Regarding the investigation cases in name_matches.csv: I still do not understand (either for Type 1 or Type 2 cases) what it means to verify that the permno is correct. All of the Type 1 Cases have the same permno between crsp_link and company_link. Should I simply check the company name in stocknames corresponding to the shared permno to see if it matches the co_name, or the call_co_name, or check the ticker? I am just looking for a procedural way to verify the permno for a given case.

Basically, we want to confirm that the company that held the call is the same firm as that that has the permno shared across the two tables.

After looking through the Type 1 cases, it seems that virtually all the cases, according to stocknames, have the same permno across two or more different company names which changed over time (and with matching namedt/nameenddt, with these different names resulting in the discrepancy between co_name and call_co_name. If it is simply a matter of the shared permno describing the same company in stocknames, then I will be able to go through and set investigate=FALSE, investigated=TRUE for all the Type 1 cases.

Same company, name changed over time => set investigate=FALSE, investigated=TRUE and note accordingly.

iangow commented 8 years ago

Here I have added file_name=1670264_T, permno=79477, co_name=Universal American Financial Corp to the manual matches.

Yes, that looks right.

Note that if a file_name has no good permno associated with it, I believe that by adding that file_name to the manual matches without a permno (i.e., blank), this will prevent the code from trying to find a bad match.

rudyardrichter commented 7 years ago

I finished going through the investigate==TRUE cases in the name_matches sheet and added entries in the manual matches sheet where necessary.

Is there any other work to be done on this issue now?

iangow commented 7 years ago

Is there any other work to be done on this issue now?

I think that I need to process the data and then think about how to document what we've done. Then, we need to extend the process to new data.

iangow commented 7 years ago

Specifically, I think we need to work out a way to predict cases needing manual matching.

iangow commented 7 years ago

I will split remaining related tasks into separate issues.