Closed iangow closed 7 years ago
Another illustration. First identify some problematic cases:
> crsp_link %>%
+ inner_join(permcos) %>%
+ inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+ filter(permco.x != permco.y) %>%
+ collect() %>%
+ anti_join(investigation)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 10 x 7
file_name permno.x match_type match_type_desc permco.x permno.y permco.y
<chr> <int> <int> <chr> <int> <int> <int>
1 3930276_T 81677 5 5. Match on ticker and exact name Soundex between company dates 30913 82107 2253
2 4150640_T 81677 5 5. Match on ticker and exact name Soundex between company dates 30913 82107 2253
3 4212178_T 81677 5 5. Match on ticker and exact name Soundex between company dates 30913 82107 2253
4 3952476_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
5 4153751_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
6 4212164_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
7 4723832_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
8 1670264_T 12785 2 2. Roll matches back & forward in StreetEvents 53766 79477 5981
9 611752_T 93179 2 2. Roll matches back & forward in StreetEvents 53305 70965 21407
10 1294427_T 19828 7 7. Match ticker & fuzzy name Soundex between company dates 20550 63060 4961
Let's take the first one. This does not appear to be a case of a name change:
> calls %>% filter(file_name=='3930276_T') %>% select(call_desc, call_date, ticker, co_name)
Source: query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
call_desc call_date ticker co_name
<chr> <time> <chr> <chr>
1 Q1 2011 Westinghouse Air Brake Technologies Corp Earnings Conference Call 2011-04-26 14:00:00 WAB Westinghouse Air Brake Technologies Corp
But it does seem that the permno
on crsp_link
(82107
) is wrong:
> stocknames %>% filter(permno==81677L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 81677 30913 1995-06-16 2000-05-01 92974010 96038610 WAB WESTINGHOUSE AIR BRAKE CO NEW 1 1 3743 11 <NA> 1995-06-30 2014-06-30 2
2 81677 30913 2000-05-02 2014-06-30 92974010 92974010 WAB WABTEC CORP 1 1 3743 11 <NA> 1995-06-30 2014-06-30 2
> stocknames %>% filter(permno==82107L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 82107 2253 1976-10-15 1983-06-30 95709010 45384010 IBCX INDEPENDENT BANKSHARES CORP 3 3 0 11 <NA> 1976-10-29 2014-06-30 2
2 82107 2253 1983-07-01 1987-01-08 95709010 95709010 WSAM WESTAMERICA BANCORPORATION 3 3 6711 11 <NA> 1976-10-29 2014-06-30 2
3 82107 2253 1987-01-09 1994-04-25 95709010 95709010 WAB WESTAMERICA BANCORPORATION 3 2 6025 11 <NA> 1976-10-29 2014-06-30 2
4 82107 2253 1994-04-26 2014-06-30 95709010 95709010 WABC WESTAMERICA BANCORPORATION 3 3 6060 11 <NA> 1976-10-29 2014-06-30 2
One way to fix the issues is to get the file_name
values and (with a little clean-up in Vim), dump them in the manual match sheet with the correct permno
:
> crsp_link %>% filter(permno==82107L) %>% inner_join(calls) %>% select(file_name, co_name) %>% print(n=100)
Joining, by = "file_name"
Source: query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_name co_name
<chr> <chr>
1 1009827_T Westinghouse Air Brake Technologies Corp
2 1051816_T Westinghouse Air Brake Technologies Corp
3 1096519_T Westinghouse Air Brake Technologies Corp
4 1151609_T Westinghouse Air Brake Technologies Corp
5 1206648_T Westinghouse Air Brake Technologies Corp
6 1269221_T Westinghouse Air Brake Technologies Corp
7 1351220_T Westinghouse Air Brake Technologies Corp
8 1396331_T Westinghouse Air Brake Technologies Corp
9 1477526_T Westinghouse Air Brake Technologies Corp
10 1525059_T Westinghouse Air Brake Technologies Corp
11 1599942_T Westinghouse Air Brake Technologies Corp
12 1633991_T Westinghouse Air Brake Technologies Corp
13 1747872_T Westinghouse Air Brake Technologies Corp
14 1817672_T Westinghouse Air Brake Technologies Corp
15 1889049_T Westinghouse Air Brake Technologies Corp
16 1984698_T Westinghouse Air Brake Technologies Corp
17 2075711_T Westinghouse Air Brake Technologies Corp
18 2151105_T Westinghouse Air Brake Technologies Corp
19 2301735_T Westinghouse Air Brake Technologies Corp
20 2467790_T Westinghouse Air Brake Technologies Corp
21 2767534_T Westinghouse Air Brake Technologies Corp
22 3029582_T Westinghouse Air Brake Technologies Corp
23 3215279_T Westinghouse Air Brake Technologies Corp
24 3431761_T Westinghouse Air Brake Technologies Corp
25 3733170_T Westinghouse Air Brake Technologies Corp
26 3930276_T Westinghouse Air Brake Technologies Corp
27 4150640_T Westinghouse Air Brake Technologies Corp
28 4212178_T Westinghouse Air Brake Technologies Corp
29 4729068_T Westinghouse Air Brake Technologies Corp
30 4785250_T Westinghouse Air Brake Technologies Corp
31 4863167_T Westinghouse Air Brake Technologies Corp
32 4926043_T Westinghouse Air Brake Technologies Corp
33 5011231_T Westinghouse Air Brake Technologies Corp
34 5058074_T Westinghouse Air Brake Technologies Corp
35 5125362_T Westinghouse Air Brake Technologies Corp
36 5198092_T Westinghouse Air Brake Technologies Corp
37 5282746_T Westinghouse Air Brake Technologies Corp
38 5343513_T Westinghouse Air Brake Technologies Corp
39 5434590_T Westinghouse Air Brake Technologies Corp
40 5507083_T Westinghouse Air Brake Technologies Corp
41 5616495_T Westinghouse Air Brake Technologies Corp
42 5679876_T Westinghouse Air Brake Technologies Corp
43 5766371_T Westinghouse Air Brake Technologies Corp
44 5830846_T Westinghouse Air Brake Technologies Corp
45 5910159_T Westinghouse Air Brake Technologies Corp
46 5983177_T Westinghouse Air Brake Technologies Corp
47 5983177_T Westinghouse Air Brake Technologies Corp
48 617198_T Westinghouse Air Brake Technologies Corp
49 641477_T Westinghouse Air Brake Technologies Corp
50 661789_T Westinghouse Air Brake Technologies Corp
51 689053_T Westinghouse Air Brake Technologies Corp
52 710788_T Westinghouse Air Brake Technologies Corp
53 731219_T Westinghouse Air Brake Technologies Corp
54 763408_T Westinghouse Air Brake Technologies Corp
55 796167_T Westinghouse Air Brake Technologies Corp
56 840472_T Westamerica Bancorporation
57 845360_T Westinghouse Air Brake Technologies Corp
58 875906_T Westinghouse Air Brake Technologies Corp
59 917665_T Westinghouse Air Brake Technologies Corp
60 952268_T Westinghouse Air Brake Technologies Corp
The problem I see here is that I have no idea how Vincent (prior RA) identified and fixed these cases.
OK. Let's do one more. Here the PERMNO should be 80912
.
> crsp_link %>%
+ inner_join(permcos) %>%
+ inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+ filter(permco.x != permco.y) %>%
+ collect() %>%
+ anti_join(investigation)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 10 x 7
file_name permno.x match_type match_type_desc permco.x permno.y permco.y
<chr> <dbl> <int> <chr> <dbl> <dbl> <dbl>
1 3930276_T 81677 0 0. Manual matches 30913 82107 2253
2 4150640_T 81677 0 0. Manual matches 30913 82107 2253
3 4212178_T 81677 0 0. Manual matches 30913 82107 2253
4 3952476_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
5 4153751_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
6 4212164_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
7 4723832_T 48389 7 7. Match ticker & fuzzy name Soundex between company dates 21786 80912 27333
8 1670264_T 12785 2 2. Roll matches back & forward in StreetEvents 53766 79477 5981
9 611752_T 93179 2 2. Roll matches back & forward in StreetEvents 53305 70965 21407
10 1294427_T 19828 7 7. Match ticker & fuzzy name Soundex between company dates 20550 63060 4961
> calls %>% filter(file_name=='3952476_T') %>% select(call_desc, call_date, ticker, co_name)
Source: query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
call_desc call_date ticker co_name
<chr> <time> <chr> <chr>
1 Q1 2011 TGC Industries Inc Earnings Conference Call 2011-05-02 13:30:00 TGE TGC Industries Inc
> stocknames %>% filter(permno==48389L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<dbl> <dbl> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 48389 21786 1969-05-21 1979-05-09 90311910 89881310 TGE TUCSON GAS & ELECTRIC CO 1 1 4911 11 <NA> 1969-05-29 2014-08-29 2
2 48389 21786 1979-05-10 1996-05-19 90311910 89881310 TEP TUCSON ELECTRIC POWER CO 1 1 4911 11 <NA> 1969-05-29 2014-08-29 2
3 48389 21786 1996-05-20 1998-01-01 90311910 89881370 TEP TUCSON ELECTRIC POWER CO 1 1 4911 11 <NA> 1969-05-29 2014-08-29 2
4 48389 21786 1998-01-02 2012-05-13 90311910 90920510 UNS UNISOURCE ENERGY CORP 1 1 4911 11 <NA> 1969-05-29 2014-08-29 2
5 48389 21786 2012-05-14 2014-08-15 90311910 90311910 UNS U N S ENERGY CORP 1 1 4911 11 <NA> 1969-05-29 2014-08-29 2
> stocknames %>% filter(permno==80912L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<dbl> <dbl> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 80912 27333 1994-09-22 1998-11-08 23936010 87241710 TGCI T G C INDUSTRIES INC 3 3 2670 11 <NA> 1994-09-30 2016-06-30 2
2 80912 27333 1998-11-09 1998-11-26 23936010 87241730 TGDC T G C INDUSTRIES INC 3 3 2670 11 <NA> 1994-09-30 2016-06-30 2
3 80912 27333 1998-11-27 2002-05-23 23936010 87241730 TGCI T G C INDUSTRIES INC 3 3 2670 11 <NA> 1994-09-30 2016-06-30 2
4 80912 27333 2002-05-24 2005-04-17 23936010 87241730 <NA> T G C INDUSTRIES INC 3 0 2670 11 <NA> 1994-09-30 2016-06-30 2
5 80912 27333 2005-04-18 2007-11-05 23936010 87241730 TGE T G C INDUSTRIES INC 3 2 1382 11 <NA> 1994-09-30 2016-06-30 2
6 80912 27333 2007-11-06 2015-02-11 23936010 87241730 TGE T G C INDUSTRIES INC 3 3 1382 11 <NA> 1994-09-30 2016-06-30 2
7 80912 27333 2015-02-12 2016-06-30 23936010 23936010 DWSN DAWSON GEOPHYSICAL CO NEW 3 3 1382 11 <NA> 1994-09-30 2016-06-30 2
So, again I just add rows to the manual match spreadsheet:
> crsp_link %>% filter(permno==48389L) %>% inner_join(calls) %>% filter(co_name ~ 'TGC') %>% select(file_name, co_name) %>% print(n=100)
Joining, by = "file_name"
Source: query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_name co_name
<chr> <chr>
1 1357818_T TGC Industries Inc
2 1400594_T TGC Industries Inc
3 1467719_T TGC Industries Inc
4 1530148_T TGC Industries Inc
5 1598763_T TGC Industries Inc
6 1666748_T TGC Industries Inc
7 1759897_T TGC Industries Inc
8 1819060_T TGC Industries Inc
9 1896145_T TGC Industries Inc
10 1996881_T TGC Industries Inc
11 2089129_T TGC Industries Inc
12 2161614_T TGC Industries Inc
13 2315390_T TGC Industries Inc
14 2498521_T TGC Industries Inc
15 2733525_T TGC Industries Inc
16 3019959_T TGC Industries Inc
17 3225171_T TGC Industries Inc
18 3431032_T TGC Industries Inc
19 3736405_T TGC Industries Inc
20 3914581_T TGC Industries Inc
21 3952476_T TGC Industries Inc
22 4153751_T TGC Industries Inc
23 4212164_T TGC Industries Inc
24 4690883_T TGC Industries Inc
25 4723832_T TGC Industries Inc
26 4783332_T TGC Industries Inc
27 4837288_T TGC Industries Inc
28 4859917_T TGC Industries Inc
29 4920985_T TGC Industries Inc
30 4997824_T TGC Industries Inc
31 5054918_T TGC Industries Inc
32 5116548_T TGC Industries Inc
33 5127691_T TGC Industries Inc
34 5198142_T TGC Industries Inc
35 5285673_T TGC Industries Inc
> source('db_matches/import_manual_permno_matches.R')
Sheet successfully identified: "streetevents.manual_permno_matches"
Accessing worksheet titled 'manual_permno_matches'.
Downloading: 29 kB
No encoding supplied: defaulting to UTF-8.
> system("psql -f db_matches/crsp_link.sql")
DROP TABLE
SELECT 275005
ALTER TABLE
CREATE INDEX
So these would be additional cases to investigate (after working through the "investigation" sheet) along the lines of what I did for "Westinghouse Air Brake Technologies Corp" and "TGC Industries Inc" above:
> crsp_link %>%
+ inner_join(permcos) %>%
+ inner_join(company_link %>% inner_join(permcos), by="file_name") %>%
+ filter(permco.x != permco.y) %>%
+ collect() %>%
+ anti_join(investigation) %>%
+ filter(match_type !=0)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 3 x 7
file_name permno.x match_type match_type_desc permco.x permno.y permco.y
<chr> <dbl> <int> <chr> <dbl> <dbl> <dbl>
1 611752_T 93179 2 2. Roll matches back & forward in StreetEvents 53305 70965 21407
2 1294427_T 19828 7 7. Match ticker & fuzzy name Soundex between company dates 20550 63060 4961
3 1670264_T 12785 2 2. Roll matches back & forward in StreetEvents 53766 79477 5981
We need to resolve these cases in some way. If the matched permno
is correct, set investigate
to FALSE
, investigated
to TRUE
, and add a comment. Otherwise, perhaps add to manual matches (and set investigate
to FALSE
, investigated
to TRUE
).
Regarding the investigation cases in name_matches.csv
: I still do not understand (either for Type 1 or Type 2 cases) what it means to verify that the permno
is correct. All of the Type 1 Cases have the same permno
between crsp_link
and company_link
. Should I simply check the company name in stocknames
corresponding to the shared permno
to see if it matches the co_name
, or the call_co_name
, or check the ticker? I am just looking for a procedural way to verify the permno
for a given case.
After looking through the Type 1 cases, it seems that virtually all the cases, according to stocknames
, have the same permno
across two or more different company names which changed over time (and with matching namedt
/nameenddt
, with these different names resulting in the discrepancy between co_name
and call_co_name
. If it is simply a matter of the shared permno
describing the same company in stocknames
, then I will be able to go through and set investigate=FALSE
, investigated=TRUE
for all the Type 1 cases.
I have also started working on some of the extra cases to investigate (generated as below) by adding the correct permno
to the manual match sheet. It might be helpful if you could perhaps check one of them to make sure that I have the right idea with how to handle these cases. For example:
> crsp_link %>%
+ inner_join(permcos) %>%
+ inner_join(company_link %>% inner_join(permcos), by="file_name",
+ suffix=c(".crsp", ".comp")) %>%
+ filter(permco.crsp != permco.comp) %>%
+ collect() %>%
+ anti_join(investigation) %>%
+ filter(match_type !=0)
Joining, by = "permno"
Joining, by = "permno"
Joining, by = "file_name"
# A tibble: 3 × 7
file_name permno.crsp match_type match_type_desc permco.crsp permno.comp permco.comp
<chr> <dbl> <int> <chr> <dbl> <dbl> <dbl>
1 1670264_T 12785 2 2. Roll matches back & forward in StreetEvents 53766 79477 5981
2 611752_T 93179 2 2. Roll matches back & forward in StreetEvents 53305 70965 21407
3 1294427_T 19828 7 7. Match ticker & fuzzy name Soundex between company dates 20550 63060 4961
> calls %>% filter(file_name=='1670264_T')
Source: query [?? x 9]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]
file_path file_name ticker co_name call_desc call_date city call_type last_update
<chr> <chr> <chr> <chr> <chr> <dttm> <chr> <int> <dttm>
1 StreetEvents_historical_backfill_through_May2013/dir_1/1670264_T.xml 1670264_T UAM Universal American Corp Q3 2007 Universal American Earnings Conference Call 2007-11-02 13:30:00 Rye Brook 1 2007-11-02 14:47:30
> stocknames %>% filter(permno==12785)
Source: query [?? x 16]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<dbl> <dbl> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 12785 53766 2011-05-02 2016-06-30 91338E10 91338E10 UAM UNIVERSAL AMERICAN CORP NEW 1 1 6311 11 <NA> 2011-05-31 2016-06-30 2
> stocknames %>% filter(permno==79477)
Source: query [?? x 16]
Database: postgres 9.4.2 [rudyardrichter@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<dbl> <dbl> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 79477 5981 1983-05-12 1996-07-22 91337710 91359010 UHCO UNIVERSAL HOLDING CORP 1 3 6310 11 <NA> 1983-05-31 2011-04-29 2
2 79477 5981 1996-07-23 2007-12-02 91337710 91337710 UHCO UNIVERSAL AMERICAN FINANCIAL CO 1 3 6310 11 <NA> 1983-05-31 2011-04-29 2
3 79477 5981 2007-12-03 2011-04-29 91337710 91337710 UAM UNIVERSAL AMERICAN CORP 1 1 6311 11 <NA> 1983-05-31 2011-04-29 2
Here I have added file_name=1670264_T
, permno=79477
, co_name=Universal American Financial Corp
to the manual matches.
It seems notifications of updates here have been going into my junk folder for the last five weeks. Let me look at these and get back to you.
Regarding the investigation cases in
name_matches.csv
: I still do not understand (either for Type 1 or Type 2 cases) what it means to verify that thepermno
is correct. All of the Type 1 Cases have the samepermno
betweencrsp_link
andcompany_link
. Should I simply check the company name instocknames
corresponding to the sharedpermno
to see if it matches theco_name
, or thecall_co_name
, or check the ticker? I am just looking for a procedural way to verify thepermno
for a given case.
Basically, we want to confirm that the company that held the call is the same firm as that that has the permno
shared across the two tables.
After looking through the Type 1 cases, it seems that virtually all the cases, according to
stocknames
, have the samepermno
across two or more different company names which changed over time (and with matchingnamedt
/nameenddt
, with these different names resulting in the discrepancy betweenco_name
andcall_co_name
. If it is simply a matter of the sharedpermno
describing the same company instocknames
, then I will be able to go through and setinvestigate=FALSE
,investigated=TRUE
for all the Type 1 cases.
Same company, name changed over time => set investigate=FALSE
, investigated=TRUE
and note accordingly.
Here I have added
file_name=1670264_T
,permno=79477
,co_name=Universal American Financial Corp
to the manual matches.
Yes, that looks right.
Note that if a file_name
has no good permno
associated with it, I believe that by adding that file_name
to the manual matches without a permno
(i.e., blank), this will prevent the code from trying to find a bad match.
I finished going through the investigate==TRUE
cases in the name_matches
sheet and added entries in the manual matches sheet where necessary.
Is there any other work to be done on this issue now?
Is there any other work to be done on this issue now?
I think that I need to process the data and then think about how to document what we've done. Then, we need to extend the process to new data.
Specifically, I think we need to work out a way to predict cases needing manual matching.
I will split remaining related tasks into separate issues.
@rudyardrichter:
(CC @ azakolyukina)
Do you have an account (e.g., Gmail address) through which I can share a Google Sheets document?
-Ian