Closed iangow closed 7 years ago
This document was produced using code in match_analysis.Rmd
. (To make .md
version, run knit("match_analysis.Rmd")
followed by pandoc('match_analysis.md', format='markdown_github')
))
Our goal is to match StreetEvents conference calls to PERMNOs on CRSP. StreetEvents conferences calls are identified (pretty much) by their file_name
values.
We have an "automatic" match created by code in crsp_link.sql
and stored in the database table streetevents.crsp_link
. I put "automatic" in quotes because this code does leverage manual matches in a Google Sheets document.
We also have a table that represents our current definitive mapping in streetevents.company_link
. But this table is out of date and we need to update it. Unfortunately, the process by which it was created is not well understood.
So the first order of business is to examine differences between streetevents.crsp_link
and streetevents.company_link
:
streetevents.company_link
generally better?permco
) but different security (i.e., different permno
).
permno
in general in streetevents.company_link
?streetevents.company_link
and streetevents.crsp_link
will often represent bad matches in streetevents.company_link
that we identified and corrected. How can we identify these? One known issue is the tendency of ThomsonReuters (the provider of StreetEvents) to "backdate" ticker and company names for acquired companies. These can usually be identified by discrepancies between co_name
and call_desc
.# Use Sys.setenv(PGUSER = "rudyardrichter", PGPASSWORD = "xxxxxx") if necessary.
pg <- src_postgres(host = "aaz.chicagobooth.edu", dbname = "postgres")
tbl_pg <- function(table) {
tbl(pg, sql(paste0("SELECT * FROM ", table)))
}
# Run code in streetevents/crsp_link.sql to create
# streetevents.crsp_link
crsp_link <- tbl_pg("streetevents.crsp_link")
company_link <- tbl_pg("streetevents.company_link")
manual_permno_matches <- tbl_pg("streetevents.manual_permno_matches")
calls <- tbl_pg("streetevents.calls")
stocknames <- tbl_pg("crsp.stocknames")
Here's a small snippet of data from the streetevents.calls
. (More detailed information about the calls (what was said, by whom, etc.) is found in streetevents.speaker_data
.)
calls %>%
filter(call_type==1L, !is.na(call_date)) %>%
select(file_name, ticker, co_name, call_desc, call_date) %>%
top_n(10) %>%
mutate(call_date=sql("call_date::date")) %>%
kable()
## Selecting by call_date
file_name | ticker | co_name | call_desc | call_date |
---|---|---|---|---|
6044956_T | FC | Franklin Covey Co | Q3 2016 Franklin Covey Co Earnings Call | 2016-06-30 |
6038410_T | MU | Micron Technology Inc | Q3 2016 Micron Technology Inc Earnings Call | 2016-06-30 |
6040656_T | SCHN | Schnitzer Steel Industries Inc | Q3 2016 Schnitzer Steel Industries Inc Earnings Call | 2016-06-30 |
6043433_T | LNN | Lindsay Corp | Q3 2016 Lindsay Corp Earnings Call | 2016-06-30 |
6035947_T | STZ | Constellation Brands Inc | Q1 2017 Constellation Brands Inc Earnings Call | 2016-06-30 |
6044926_T | IRET | Investors Real Estate Trust | Q4 2016 Investors Real Estate Trust Earnings Call | 2016-06-30 |
6044900_T | PAYX | Paychex Inc | Q4 2016 Paychex Inc Earnings Call | 2016-06-30 |
5980908_T | CAG | ConAgra Foods Inc | Q4 2016 ConAgra Foods Inc Earnings Call | 2016-06-30 |
6050288_T | SBLK | Star Bulk Carriers Corp | Q1 2016 Star Bulk Carriers Corp Earnings Call | 2016-06-30 |
5975801_T | DRI | Darden Restaurants Inc | Q4 2016 Darden Restaurants Inc Earnings Call | 2016-06-30 |
streetevents.crsp_link
and streetevents.company_link
Let's join these two tables using file_name
to see how often we get the same permno
in both tables.
# How many firms have the same PERMNO as identified with
# the ticker match?
match_compare <-
crsp_link %>%
inner_join(company_link, by="file_name") %>%
mutate(same_permno=permno.x==permno.y) %>%
compute()
It seems we get the same match in 99% of cases.
Here are the raw numbers:
match_compare %>%
group_by(same_permno) %>%
summarize(count=n()) %>%
kable()
same_permno | count |
---|---|
NA | 3 |
TRUE | 75096 |
FALSE | 608 |
To facilitate examination of the cases where the two tables differ, I saved data to CSV, which I then made into a Google Sheets document here.
library(readr)
match_compare %>%
filter(!same_permno) %>%
inner_join(calls) %>%
as.data.frame() %>%
write_csv("permno_match_diffs.csv")
## Joining, by = "file_name"
Running the code above is easiest with a recent version of RStudio. After installing that, clone this repository and then open the .Rproj
file in the repository's root directory.
@rudyardrichter We may need to fix permissions for you to be able to run the code you mentioned. But I believe the tables have been updated recently, so it shouldn't necessary at this stage. I could take a closer look tomorrow.
@iangow: Sorry, I figured out what the problem was there. I'm just reading through everything for now; I will probably have some questions about the issue soon, if you don't mind.
No problem. Just ask here (I get notified of any comments under this issue).
@iangow Some initial questions:
match_type
is better?permno
?file_name
— then, does a given file_name
relate to co_name
though its row in streetevents.calls
?Thanks for helping me get caught up on this.
What is the metric of quality for matches? Is it that lower
match_type
is better?
There is no metric, but the code tends to look for less-ambiguous matches first and the numbering reflects that.
What is
permno
?
The main firm (actually security) identifier on CRSP. Take a look at the table crsp.company
for more insight.
So StreetEvents calls are referred to by
file_name
— then, does a givenfile_name
relate toco_name
though its row instreetevents.calls
?
More or less.
The main task for now is to take a sample of the 608 areas of disagreement and identify ways that we could flag such cases _without using data from streetevents.company_link
_.
The table streetevents.company_link
is a "gold standard" of sorts, but we did not do a great job of documenting how we created it. We want to extend the logic underlying its creation to the tens of thousands of new entries we have on StreetEvents since that table was created.
@iangow:
The access request for the Google sheet I sent you is for the sheet you mentioned in the issue statement (towards the end).
Relatedly, it looks like I will need permission to access the crsp
schema:
postgres=> SELECT * FROM crsp.company;
ERROR: permission denied for schema crsp
Though, I don't see company
in the list of tables for the crsp
schema. (I still get the permission error on the tables that are definitely there.)
Sorry, it's comp.company
. I meant crsp.stocknames
. Similar tables, different sources (CRSP versus Compustat).
This should cover the permissions issue: GRANT crsp_basic TO rudyardrichter ;
.
Okay, thanks.
When I run that command I get: ERROR: must have admin option on role "crsp_basic"
Of course. I ran that command to give you access.
I just put it here so I had a record of what I did. I don't really have a systematic approach to managing permissions on Nastia's database.
Ah, I see. Seems to be working now—thank you!
From 7bd4da6:
Most, but not all cases with differing permno
have the same permco—diff_permco_summary
:
same_permco | count |
---|---|
FALSE | 117 |
TRUE | 491 |
Is there any documentation on the significance of the data in crsp.stocknames
(e.g. namedum
, hexcd
)? Moreover would these potentially have any relevance to the issue in the first place?
If so, I think we should focus on the 117 cases with different PERMCOs. Are there any indicators that would work to flag those cases? For example, does the call_desc
suggest the company's name was changed?
I think the documentation of crsp.stocknames
might be relevant for the same-PERMCO cases. But I suspect the different-PERMCO cases may be the ones to focus on first.
Some of the 117 cases with both a different permno
and a different permco
have disagreeing co_name
and call_desc
—for instance there are several instances of the following:
co_name |
call_desc |
---|---|
Web.com Group Inc | ...Interland, Inc. Earnings Conference Call |
Silicon Graphics International Corp | ...Rackable Systems, Inc.... |
Crestwood Midstream Partners LP | ...QUICKSILVER GAS SERVICES LP... |
See the table diff_permco_calls
from my most recent commit, 7ad6487.
However, some of these disagreements for same_permco==False
entries are not as blatant, for instance:
co_name |
call_desc |
---|---|
Universal American Corp | Q4 2005 Universal American Financial Corp. Earnings |
And there are entries with same_permco==True
yet having similar small differences:
co_name |
call_desc |
---|---|
Haverty Furniture Companies Inc | Q3 2005 HAVERTY FURNITURE INC Earnings Conference Call |
We have some code to extract the "Interland, Inc." portion of call_desc
and compare it with co_name
. Perhaps this would be helpful. I think we'd want to be fairly conservative so that the Haverty and Universal cases are not flagged (looking at those names, I think some other issue must've caused us to pick a different PERMNO for these.
In summary, we have three cases right now.
call_desc
does not match co_name
. We need to work out first how to flag these, then how to fix them (it may be as simple as manually matching names).call_desc
does not matches (approximately) co_name
. Can we deduce how we matched these? Are these found in the Google Sheets document as manual matches?Regarding 3—it appears that none of the file_name
s for cases with different permno
appear in manual_permno_matches
:
diff_permco_calls_manual <-
diff_permco_calls %>%
inner_join(manual_permno_matches, by="file_name")
> count(diff_permco_calls_manual)
...
0
(From b4beb4a.)
Is there a way to source SQL code/functions like in db_matches/extract_conm.sql
for use in dplyr
functions?
Sorry for the delay. I am traveling right now.
This would work:
Sys.setenv(PGHOST = "aaz.chicagobooth.edu", PGDATABASE = "postgres")
query <- paste(readLines("db_matches/extract_conm.sql"), collapse="\n")
library(dplyr)
pg <- src_postgres()
library(RPostgreSQL)
dbGetQuery(pg$con, sql(query))
Is that code working for you? I keep getting the following error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: permission denied for schema public
)
NULL
Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
Could not create executeCREATE OR REPLACE FUNCTION extract_conm(call_desc text)
RETURNS text AS $$
DECLARE
regex text;
matches text[];
BEGIN
-- Get text between e.g., "2004" or "2005/06" and "Results" or "Earnings"
regex := '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)';
matches := regexp_matches(call_desc, regex);
RETURN matches[1];
END;
$$ LANGUAGE plpgsql;
This may fix it:
postgres=# GRANT CREATE ON SCHEMA public TO rudyardrichter ;
GRANT
So try again.
Now instead of permission denied for schema public
I get must be owner of function extract_conm
Would it be better if I just rewrote the extract_conm
function in R?
Sure. You could just use PostgreSQL functions for the current version:
SELECT call_desc,
UNNEST(regexp_matches(call_desc, '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)'))
FROM streetevents.calls
LIMIT 10;
Or, using dplyr
:
Sys.setenv(PGHOST = "aaz.chicagobooth.edu", PGDATABASE = "postgres")
library(dplyr)
pg <- src_postgres()
regex <- '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)'
calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
calls %>%
mutate(co_name_matches=regexp_matches(call_desc, regex)) %>%
mutate(co_name_alt=sql("trim(both from co_name_matches[1])")) %>%
select(file_name, call_desc, co_name_alt, call_desc)
Of the cases with both different permno and different permco, most (but not all) have a mismatch between co_name
and the company name extracted from the call description, call_co_name
:
diff_permco_co_names_summary
same_co_name | count |
---|---|
FALSE | 107 |
TRUE | 9 |
However, the function to compare the company name with the company name in the call description (compareNames
, 64c85e0) is rather lenient, ignoring most punctuation as well as "Inc", "Ltd", "Co", etc.; there are only 3 cases with an exact match (company "Westinghouse Air Brake Technologies Corp").
The cases with different permno
and permco
boil down to 22 company name vs. call description company name matchings:
diff_permco_distinct_co_names
co_name | call_co_name | same_co_name |
---|---|---|
Accelrys Inc | Pharmacopeia | FALSE |
Caesars Entertainment Corp | Park Place Entertainment | FALSE |
Crestwood Midstream Partners LP | QUICKSILVER GAS SERVICES LP | FALSE |
IMMC Corp | Immunicon Corporation | FALSE |
Monster Beverage Corp | Hansen Natural | FALSE |
Monster Beverage Corp | Hansen Natural Corp | FALSE |
OfficeMax Inc | Boise Cascade | FALSE |
Pilgrims Pride Corp | Pilgrim`s Pride | TRUE |
Silicon Graphics International Corp | Rackable Systems, Inc. | FALSE |
Six Flags Entertainment Corp | Six Flags, Inc. | FALSE |
TGC Industries Inc | TGC | FALSE |
TGC Industries Inc | TGC Industries Inc | TRUE |
TGC Industries Inc | TGC Industries | TRUE |
The Wendy's Co | Wendy's International | FALSE |
The Wendy's Co | Wendy`s International | FALSE |
Universal American Corp | Universal American Financial Corp. | FALSE |
Universal American Corp | Universal American | TRUE |
Universal American Corp | Universal American Financial | FALSE |
Web.com Group Inc | Interland, Inc. | FALSE |
Web.com Group Inc | Website Pros, Inc. | FALSE |
Web.com Group Inc | Interland | FALSE |
Westinghouse Air Brake Technologies Corp | Westinghouse Air Brake Technologies Corp | TRUE |
I can alo, of course, modify the behavior of compareNames
so that, for instance, the Universal American Corp cases all match.
There must be a different issue with the handful of cases with differing permno
and permco
but exact company name matches. Given that the number of such cases is very small it seems that these could potentially be added as manual matches. (However, I would imagine we would nonetheless want to make sure that the same issue is not contributing to the other cases with differing permno
and permco
.)
Another observation about the cases with different permno
: the ones with same permco
tend to have match_type
1, while the cases with different permco
have match_type
2, 5, or 7.
same_permco_match_type_summary
match_type | count |
---|---|
1 | 471 |
2 | 20 |
diff_permco_match_type_summary
match_type | count |
---|---|
2 | 78 |
5 | 3 |
7 | 36 |
(These are from fd074a4.)
Also for reference, match_type_permno_permco_summary
(from 0200f54) compares the number of entries having same_permno
and/or same_permco
for each match_type
. The only match_type
which do not have same_permno
and same_permco
are 1, 2, 5, and 7.
match_type_permno_permco_summary
match_type | same_permno | same_permco | count |
---|---|---|---|
0 | TRUE | TRUE | 5147 |
0 | TRUE | FALSE | 0 |
0 | FALSE | TRUE | 0 |
0 | FALSE | FALSE | 0 |
1 | TRUE | TRUE | 461756 |
1 | TRUE | FALSE | 0 |
1 | FALSE | TRUE | 6554 |
1 | FALSE | FALSE | 0 |
2 | TRUE | TRUE | 44199 |
2 | TRUE | FALSE | 0 |
2 | FALSE | TRUE | 502 |
2 | FALSE | FALSE | 1298 |
3 | TRUE | TRUE | 1061 |
3 | TRUE | FALSE | 0 |
3 | FALSE | TRUE | 0 |
3 | FALSE | FALSE | 0 |
4 | TRUE | TRUE | 12 |
4 | TRUE | FALSE | 0 |
4 | FALSE | TRUE | 0 |
4 | FALSE | FALSE | 0 |
5 | TRUE | TRUE | 899 |
5 | TRUE | FALSE | 0 |
5 | FALSE | TRUE | 0 |
5 | FALSE | FALSE | 48 |
7 | TRUE | TRUE | 3125 |
7 | TRUE | FALSE | 0 |
7 | FALSE | TRUE | 0 |
7 | FALSE | FALSE | 1354 |
8 | TRUE | TRUE | 1 |
8 | TRUE | FALSE | 0 |
8 | FALSE | TRUE | 0 |
8 | FALSE | FALSE | 0 |
9 | TRUE | TRUE | 1481 |
9 | TRUE | FALSE | 0 |
9 | FALSE | TRUE | 0 |
9 | FALSE | FALSE | 0 |
10 | TRUE | TRUE | 8 |
10 | TRUE | FALSE | 0 |
10 | FALSE | TRUE | 0 |
10 | FALSE | FALSE | 0 |
There must be a different issue with the handful of cases with differing
permno
andpermco
but exact company name matches. Given that the number of such cases is very small it seems that these could potentially be added as manual matches. (However, I would imagine we would nonetheless want to make sure that the same issue is not contributing to the other cases with differingpermno
andpermco
.)
The question is: How can we identify these cases in the first place? What triggered the decision to override the initial match in the first place? Then, if these cases are not manual matches, how were they matched to permno
/permco
values?
From my comment above:
In summary, we have three cases right now.
- Same PERMCO, different PERMNO. Ignore for now. We will later need to work out how to choose the PERMNO (if necessary to choose).
- Different PERMNO and name in
call_desc
does not matchco_name
. We need to work out first how to flag these, then how to fix them (it may be as simple as manually matching names).- Different PERMNO, name in
call_desc
does not matches (approximately)co_name
. Can we deduce how we matched these? Are these found in the Google Sheets document as manual matches?
Oops. Item 3 should be " Different PERMNO, name in call_desc
does not matches (approximately) co_name
." Clearly, if the company name matches, some other basis was used to replace the initially matched permno
.
Regarding item 2, I wonder how many "false positives" the code you're testing would produce. How many cases have the same PERMNO, but same_co_name==FALSE
? If there are many, would it be feasible to go through and identify ones that do not require further investigation? (For example, a Google Sheets document could be created with co_name
and call_co_name
and an investigate
column. If there are thousands of such cases, I think that would be too much work, but a few hundred might be pretty easy to go through. It would be necessary to have a way of keeping track of cases (file_name
or co_name
values) that have been evaluated in this way so these task is not repeated for the same calls (and we might consider the matches in the current crsp_link
to have evaluated in this way).
To answer your last question—there are 8541 cases with same_permno
but not same_co_name
:
permco_co_name_summary
(9b2c190)same_permno | same_co_name | count |
---|---|---|
TRUE | TRUE | 66485 |
TRUE | FALSE | 8541 |
FALSE | TRUE | 469 |
FALSE | FALSE | 137 |
…so a manual identification for these seems infeasible.
I think you can reduce that number dramatically by grouping by co_name
and call_co_name
. I made a Google Sheets document here that I think would be pretty easy to go through. I did about a half of it, so if you could take a look at the rest ...
I just pushed the code that I used to make the underlying data.
Because I did not condition on the existence of a match, if we fill out the spreadsheet, we could evaluate whether doing so will flag mis-matches based on what we have in streetevents.crsp_link
. And because I retained the matching file_name
values, it is easy to keep track of what has been checked.
To clarify, under what conditions should an entry be marked investiage=TRUE
? If co_name
and call_co_name
do not refer to the same company?
If based on the name there is a real risk that they refer to different companies. We will need to work out how to "investigate" but that should be fairly easy.
What should be done for the cases in which a company changed its name from one to the other? investigate=FALSE
?
And also cases where:
Ultimately, we are trying to map each call (file_name
) to permno
. This step is a piece of that puzzle.
If a company changes name, it doesn't change its permno
. So if we know that this is what happened (e.g., Kraft Foods become Mondelez), then set investigate
to FALSE
.
If the call is by a subsidiary, but the permno
is for the parent, then I think we don't want these to match. Set investigate
to TRUE
and, if confirmed, we can either manually "match" the file_name
to no permno
(if none exists) or to the subsidiary's permno
(if one exists).
If a company is merged with or acquired by another, then set investigate
to TRUE
. We will need to investigate and set the permno
correctly.
I will need to follow up with more details on the "investigation" process; for now, we can just flag cases for future investigation.
Thanks for the guidance. I've been going through my section of name_matches.csv
again to mark potential disparities with comments detailing the problem.
For the sake of thorough documentation (x
and y
refer to co_name
or call_co_name
interchangeably), these are the comments I'm using:
Comment | Investigate | Details |
---|---|---|
name change | FALSE |
Company changed its name from one to the other |
bad call_co_name parse | FALSE |
call_co_name was not parsed correctly, but the names refer to the same company |
illegible call_co_name parse | TRUE |
call_co_name was not parsed correctly and the correct name cannot be inferred |
x is subsidiary |
TRUE |
x is a subsidiary of y |
x formed from merger of y |
TRUE |
y merged with a different company to form x |
x acquired y |
TRUE |
Company in x acquired company in y |
missing x |
TRUE |
The field for x is empty |
Multiple comments for one entry are separated with semicolons.
@rudyardrichter
Thanks for the update. I think that the number of cases to investigate is quite manageable. I pushed some code (see link to commit above this comment).
I think the database table crsp.stocknames
could be a good resource for resolving issues.
Let me know if you have any questions.
-Ian
I am a little confused about the type 1 cases in investigation_cases.R:
# Type 1: Observations on crsp_link already.
# Here we need to work out which PERMNO is correct and add to
# manual_permno_matches spreadsheet
If the rows of name_matches.csv
already have the same permno
, as per match_compare_plus
in match_analysis.Rmd, then what does checking the correct permno
amount to here? Verifying that the shared permno
is correct?
Regarding the type 2 cases: how should the permno
from company_link
be verified? Is this where stocknames
comes in?
As for Type 2 cases:
# If the PERMNO is confirmed correct, then it may be easiest to change
# "investigate" to FALSE in the "investigation" Google Sheets document, perhaps
# noting there that the case has been investigated.
But it may make sense to check some cases with the same PERMNO in both tables. I'm not sure how good the earlier code was.
I think stocknames
will help a lot. Many name changes will be documented there. If the names match to the same PERMNO, then we are probably safe to confirm the match. It's really only when we need to match to a different PERMNO that a manual match is needed.
How should the permno
in company_link
be checked against stocknames
---given that stocknames
does not have an obvious choice to join by? If the permno
occurs in the same row as the ticker (or company name) corresponding to that company, does that confirm the match (in which case, I suppose that would be the appropriate variable to join by)?
How should the
permno
incompany_link
be checked againststocknames
---given thatstocknames
does not have an obvious choice to join by? If thepermno
occurs in the same row as the ticker (or company name) corresponding to that company, does that confirm the match (in which case, I suppose that would be the appropriate variable to join by)?
This will not be an automatic join, but confirming an addition to the manual-match spreadsheet.
It might be easier to address your question if you have an example of a match you are considering. But in general, if co_name
or call_co_name
(or variants thereon) are associated with the same permno
and the same ticker
, then I think that's a valid match. This can be implemented by either adding the match to the manual-match Google Sheets document or removing the investigate
flag from the investigation sheet.
This will not be an automatic join, but confirming an addition to the manual-match spreadsheet.
Right, I just meant in terms of being able to compare type_1_cases
or type_2_cases
against stocknames
without having to filter for each entry by hand.
I was wondering about the instruction in Type 2:
Here we need to see if the PERMNO on company_link is correct
For an example, take this entry in type_2_cases
:
co_name | call_co_name | num_calls | investigate | comment | file_name | permno |
---|---|---|---|---|---|---|
Hologic Inc | CYTYC | 1 | TRUE | call_co_name is subsidiary | 648370_T | 83251 |
Searching in stocknames
for CYTYC yields this entry:
permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
83251 | 14414 | 1996-03-08 | 2007-10-22 | 23294610 | 23294610 | CYTC | CYTYC CORP | 3 | 3 | 3820 | 11 | NA | 1996-03-29 | 2007-10-31 | 2 |
If the permno
for Hologic and the permno
for CYTYC are the same (83251), does that mean that we are safe to confirm the match?
We don't need to worry about this one, as it appears to be already a manual match on crsp_link
.
> company_link %>% filter(file_name=='648370_T')
Source: query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_name permno
<chr> <int>
1 648370_T 83251
> crsp_link %>% filter(file_name=='648370_T')
Source: query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_name permno match_type match_type_desc
<chr> <int> <int> <chr>
1 648370_T 83251 0 0. Manual matches
The issues are really when we have differences between crsp_link
and company_link
. To recap, company_link
is the "gold standard" of sorts here (though we should be open to the possibility of errors therein), but the process for creating it is lost to time, etc., and we have no way of extending it.
Thus we want to use the process of "investigation" outlined above to get crsp_link
to reflect the information in company_link
and the process for creating it. Then we can extend the table to cover file_name
values not covered by company_link
(this table was based on StreetEvents a couple of years ago).
Once we've sorted this out, hopefully we'll have a process for keeping crsp_link
up to date (and we can document accordingly). And after that, we will need to rope in another data source to link individual speakers on StreetEvents to a data set of executives (Equilar).
So a better example might be one where crsp_link
and company_link
disagree.
Building on code in investigation_cases.R
, here's what I see:
> crsp_link %>%
+ inner_join(company_link, by="file_name") %>%
+ filter(permno.x != permno.y) %>%
+ collect() %>%
+ left_join(investigation) %>%
+ count(investigate)
Joining, by = "file_name"
# A tibble: 3 x 2
investigate n
<lgl> <int>
1 FALSE 131
2 TRUE 6
3 NA 471
Focusing on the investigate
cases:
> crsp_link %>%
+ inner_join(company_link, by="file_name") %>%
+ filter(permno.x != permno.y) %>%
+ collect() %>%
+ left_join(investigation) %>%
+ filter(investigate)
Joining, by = "file_name"
# A tibble: 6 x 11
file_name permno.x match_type match_type_desc permno.y co_name call_co_name num_calls
<chr> <int> <int> <chr> <int> <chr> <chr> <int>
1 1742698_T 75993 2 2. Roll matches back & forward in StreetEvents 90984 Web.com Group Inc Website Pros, Inc. 3
2 1826108_T 75993 2 2. Roll matches back & forward in StreetEvents 90984 Web.com Group Inc Website Pros, Inc. 3
3 1902561_T 75993 2 2. Roll matches back & forward in StreetEvents 90984 Web.com Group Inc Website Pros, Inc. 3
4 4004472_T 84342 2 2. Roll matches back & forward in StreetEvents 88031 Monster Beverage Corp Hansen Natural Corp 3
5 4166998_T 84342 2 2. Roll matches back & forward in StreetEvents 88031 Monster Beverage Corp Hansen Natural Corp 3
6 4230684_T 84342 2 2. Roll matches back & forward in StreetEvents 88031 Monster Beverage Corp Hansen Natural Corp 3
# ... with 3 more variables: investigate <lgl>, investigated <chr>, comment <chr>
Digging deeper into the first case, it seems there was some kind of merger that occurred after the call and StreetEvents (unhelpfully) updated the ticker. So we need to fix this by manually matching
file_name=1742698_T
to permno=90984
.
> stocknames <- tbl_pg("crsp.stocknames")
> stocknames %>% filter(permno==75993L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date>
1 75993 10311 1989-09-25 1995-04-09 94732Q10 98943210 ZEOS ZEOS INTERNATIONAL LTD 3 3 3570 11 <NA> 1989-09-29 2007-09-28
2 75993 10311 1995-04-10 2001-08-06 94732Q10 59510010 MUEI MICRON ELECTRONICS INC 3 3 3570 11 <NA> 1989-09-29 2007-09-28
3 75993 10311 2001-08-07 2003-08-03 94732Q10 45872710 INLD INTERLAND INC NEW 3 3 3570 11 <NA> 1989-09-29 2007-09-28
4 75993 10311 2003-08-04 2006-03-19 94732Q10 45872720 INLD INTERLAND INC NEW 3 3 3570 11 <NA> 1989-09-29 2007-09-28
5 75993 10311 2006-03-20 2007-09-28 94732Q10 94732Q10 WWWW WEB COM INC 3 3 3570 11 <NA> 1989-09-29 2007-09-28
# ... with 1 more variables: namedum <dbl>
> stocknames %>% filter(permno==90984L)
Source: query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 90984 48782 2005-11-02 2008-06-08 94733A10 94769V10 WSPI WEBSITE PROS INC 3 3 7372 11 <NA> 2005-11-30 2014-06-30 2
2 90984 48782 2008-06-09 2008-10-26 94733A10 94769V10 WWWW WEBSITE PROS INC 3 3 7372 11 <NA> 2005-11-30 2014-06-30 2
3 90984 48782 2008-10-27 2014-06-30 94733A10 94733A10 WWWW WEB COM GROUP INC 3 3 7372 11 <NA> 2005-11-30 2014-06-30 2
> calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
> calls %>% filter(file_name=='1742698_T')
Source: query [?? x 9]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_path file_name ticker co_name
<chr> <chr> <chr> <chr>
1 StreetEvents_historical_backfill_through_May2013/dir_1/1742698_T.xml 1742698_T WWWW Web.com Group Inc
# ... with 5 more variables: call_desc <chr>, call_date <time>, city <chr>, call_type <int>, last_update <time>
> calls %>% filter(file_name=='1742698_T') %>% select(call_desc, call_date, ticker)
Source: query [?? x 3]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
call_desc call_date ticker
<chr> <time> <chr>
1 Q4 2007 Website Pros, Inc. Earnings Conference Call 2008-02-12 22:00:00 WWWW
An interesting aspect of this case is that there appear to be two permno
values associated with this call in crsp_link
:
> crsp_link %>% filter(file_name=='1742698_T')
Source: query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]
file_name permno match_type match_type_desc
<chr> <int> <int> <chr>
1 1742698_T 90984 2 2. Roll matches back & forward in StreetEvents
2 1742698_T 75993 2 2. Roll matches back & forward in StreetEvents
So I think I addressed the six cases above by adding row to the manual match sheet.
> source('db_matches/import_manual_permno_matches.R')
Sheet successfully identified: "streetevents.manual_permno_matches"
Accessing worksheet titled 'manual_permno_matches'.
Downloading: 29 kB
No encoding supplied: defaulting to UTF-8.
> system("psql -f db_matches/crsp_link.sql")
DROP TABLE
SELECT 275863
ALTER TABLE
CREATE INDEX
> crsp_link %>%
+ inner_join(company_link, by="file_name") %>%
+ filter(permno.x != permno.y) %>%
+ collect() %>%
+ left_join(investigation) %>%
+ filter(investigate) %>%
+ select(file_name)
Joining, by = "file_name"
# A tibble: 0 x 1
# ... with 1 variables: file_name <chr>
@rudyardrichter:
(CC @ azakolyukina)
Do you have an account (e.g., Gmail address) through which I can share a Google Sheets document?
-Ian