iangow commented 8 years ago

@rudyardrichter:

(CC @ azakolyukina)

Do you have an account (e.g., Gmail address) through which I can share a Google Sheets document?

-Ian

iangow commented 8 years ago

This document was produced using code in match_analysis.Rmd. (To make .md version, run knit("match_analysis.Rmd") followed by pandoc('match_analysis.md', format='markdown_github')))

Matching StreetEvents to CRSP

Our goal is to match StreetEvents conference calls to PERMNOs on CRSP. StreetEvents conferences calls are identified (pretty much) by their file_name values.

We have an "automatic" match created by code in crsp_link.sql and stored in the database table streetevents.crsp_link. I put "automatic" in quotes because this code does leverage manual matches in a Google Sheets document.

We also have a table that represents our current definitive mapping in streetevents.company_link. But this table is out of date and we need to update it. Unfortunately, the process by which it was created is not well understood.

So the first order of business is to examine differences between streetevents.crsp_link and streetevents.company_link:

Are the matches in streetevents.company_link generally better?
Are differences in PERMNO matches to unrelated firms? Or to the same firm (i.e., same permco) but different security (i.e., different permno).
- If the latter, are we getting the "right" permno in general in streetevents.company_link?
Can we come up with a way of flagging cases that need correction?
- Differences between streetevents.company_link and streetevents.crsp_link will often represent bad matches in streetevents.company_link that we identified and corrected. How can we identify these? One known issue is the tendency of ThomsonReuters (the provider of StreetEvents) to "backdate" ticker and company names for acquired companies. These can usually be identified by discrepancies between co_name and call_desc.

# Use Sys.setenv(PGUSER = "rudyardrichter", PGPASSWORD = "xxxxxx") if necessary.
pg <- src_postgres(host = "aaz.chicagobooth.edu", dbname = "postgres")

tbl_pg <- function(table) {
    tbl(pg, sql(paste0("SELECT * FROM ", table)))
}

# Run code in streetevents/crsp_link.sql to create
# streetevents.crsp_link
crsp_link <- tbl_pg("streetevents.crsp_link")
company_link <- tbl_pg("streetevents.company_link")
manual_permno_matches <- tbl_pg("streetevents.manual_permno_matches")
calls <- tbl_pg("streetevents.calls")
stocknames <- tbl_pg("crsp.stocknames")

StreetEvents data

Here's a small snippet of data from the streetevents.calls. (More detailed information about the calls (what was said, by whom, etc.) is found in streetevents.speaker_data.)

calls %>% 
    filter(call_type==1L, !is.na(call_date)) %>% 
    select(file_name, ticker, co_name, call_desc, call_date) %>% 
    top_n(10) %>%
    mutate(call_date=sql("call_date::date")) %>%
    kable()

## Selecting by call_date

file_name	ticker	co_name	call_desc	call_date
6044956_T	FC	Franklin Covey Co	Q3 2016 Franklin Covey Co Earnings Call	2016-06-30
6038410_T	MU	Micron Technology Inc	Q3 2016 Micron Technology Inc Earnings Call	2016-06-30
6040656_T	SCHN	Schnitzer Steel Industries Inc	Q3 2016 Schnitzer Steel Industries Inc Earnings Call	2016-06-30
6043433_T	LNN	Lindsay Corp	Q3 2016 Lindsay Corp Earnings Call	2016-06-30
6035947_T	STZ	Constellation Brands Inc	Q1 2017 Constellation Brands Inc Earnings Call	2016-06-30
6044926_T	IRET	Investors Real Estate Trust	Q4 2016 Investors Real Estate Trust Earnings Call	2016-06-30
6044900_T	PAYX	Paychex Inc	Q4 2016 Paychex Inc Earnings Call	2016-06-30
5980908_T	CAG	ConAgra Foods Inc	Q4 2016 ConAgra Foods Inc Earnings Call	2016-06-30
6050288_T	SBLK	Star Bulk Carriers Corp	Q1 2016 Star Bulk Carriers Corp Earnings Call	2016-06-30
5975801_T	DRI	Darden Restaurants Inc	Q4 2016 Darden Restaurants Inc Earnings Call	2016-06-30

Comparing `streetevents.crsp_link` and `streetevents.company_link`

Let's join these two tables using file_name to see how often we get the same permno in both tables.

# How many firms have the same PERMNO as identified with
# the ticker match?
match_compare <-
    crsp_link %>%
    inner_join(company_link, by="file_name") %>%
    mutate(same_permno=permno.x==permno.y) %>%
    compute()

It seems we get the same match in 99% of cases.

Here are the raw numbers:

match_compare %>%
    group_by(same_permno) %>%
    summarize(count=n()) %>%
    kable()

same_permno	count
NA	3
TRUE	75096
FALSE	608

To facilitate examination of the cases where the two tables differ, I saved data to CSV, which I then made into a Google Sheets document here.

library(readr)
match_compare %>%
    filter(!same_permno) %>%
    inner_join(calls) %>%
    as.data.frame() %>%
    write_csv("permno_match_diffs.csv")

## Joining, by = "file_name"

iangow commented 8 years ago

Running the code above is easiest with a recent version of RStudio. After installing that, clone this repository and then open the .Rproj file in the repository's root directory.

iangow commented 8 years ago

@rudyardrichter We may need to fix permissions for you to be able to run the code you mentioned. But I believe the tables have been updated recently, so it shouldn't necessary at this stage. I could take a closer look tomorrow.

rudyardrichter commented 8 years ago

@iangow: Sorry, I figured out what the problem was there. I'm just reading through everything for now; I will probably have some questions about the issue soon, if you don't mind.

iangow commented 8 years ago

No problem. Just ask here (I get notified of any comments under this issue).

rudyardrichter commented 8 years ago

@iangow Some initial questions:

What is the metric of quality for matches? Is it that lower match_type is better?
What is permno?
So StreetEvents calls are referred to by file_name— then, does a given file_name relate to co_name though its row in streetevents.calls?

Thanks for helping me get caught up on this.

iangow commented 8 years ago

What is the metric of quality for matches? Is it that lower match_type is better?

There is no metric, but the code tends to look for less-ambiguous matches first and the numbering reflects that.

What is permno?

The main firm (actually security) identifier on CRSP. Take a look at the table crsp.company for more insight.

So StreetEvents calls are referred to by file_name— then, does a given file_name relate to co_name though its row in streetevents.calls?

More or less.

The main task for now is to take a sample of the 608 areas of disagreement and identify ways that we could flag such cases _without using data from streetevents.company_link_.

The table streetevents.company_link is a "gold standard" of sorts, but we did not do a great job of documenting how we created it. We want to extend the logic underlying its creation to the tens of thousands of new entries we have on StreetEvents since that table was created.

rudyardrichter commented 8 years ago

@iangow:

The access request for the Google sheet I sent you is for the sheet you mentioned in the issue statement (towards the end).

Relatedly, it looks like I will need permission to access the crsp schema:

postgres=> SELECT * FROM crsp.company;
ERROR:  permission denied for schema crsp

Though, I don't see company in the list of tables for the crsp schema. (I still get the permission error on the tables that are definitely there.)

iangow commented 8 years ago

Sorry, it's comp.company. I meant crsp.stocknames. Similar tables, different sources (CRSP versus Compustat).

This should cover the permissions issue: GRANT crsp_basic TO rudyardrichter ;.

rudyardrichter commented 8 years ago

Okay, thanks.

When I run that command I get: ERROR: must have admin option on role "crsp_basic"

iangow commented 8 years ago

Of course. I ran that command to give you access.

iangow commented 8 years ago

I just put it here so I had a record of what I did. I don't really have a systematic approach to managing permissions on Nastia's database.

rudyardrichter commented 8 years ago

Ah, I see. Seems to be working now—thank you!

rudyardrichter commented 8 years ago

From 7bd4da6:

Most, but not all cases with differing permno have the same permco—diff_permco_summary:

same_permco	count
FALSE	117
TRUE	491

rudyardrichter commented 8 years ago

Is there any documentation on the significance of the data in crsp.stocknames (e.g. namedum, hexcd)? Moreover would these potentially have any relevance to the issue in the first place?

iangow commented 8 years ago

If so, I think we should focus on the 117 cases with different PERMCOs. Are there any indicators that would work to flag those cases? For example, does the call_desc suggest the company's name was changed?

iangow commented 8 years ago

I think the documentation of crsp.stocknames might be relevant for the same-PERMCO cases. But I suspect the different-PERMCO cases may be the ones to focus on first.

rudyardrichter commented 8 years ago

Some of the 117 cases with both a different permno and a different permco have disagreeing co_name and call_desc—for instance there are several instances of the following:

`co_name`	`call_desc`
Web.com Group Inc	...Interland, Inc. Earnings Conference Call
Silicon Graphics International Corp	...Rackable Systems, Inc....
Crestwood Midstream Partners LP	...QUICKSILVER GAS SERVICES LP...

See the table diff_permco_calls from my most recent commit, 7ad6487.

However, some of these disagreements for same_permco==False entries are not as blatant, for instance:

`co_name`	`call_desc`
Universal American Corp	Q4 2005 Universal American Financial Corp. Earnings

And there are entries with same_permco==True yet having similar small differences:

`co_name`	`call_desc`
Haverty Furniture Companies Inc	Q3 2005 HAVERTY FURNITURE INC Earnings Conference Call

iangow commented 8 years ago

We have some code to extract the "Interland, Inc." portion of call_desc and compare it with co_name. Perhaps this would be helpful. I think we'd want to be fairly conservative so that the Haverty and Universal cases are not flagged (looking at those names, I think some other issue must've caused us to pick a different PERMNO for these.

In summary, we have three cases right now.

Same PERMCO, different PERMNO. Ignore for now. We will later need to work out how to choose the PERMNO (if necessary to choose).
Different PERMNO and name in call_desc does not match co_name. We need to work out first how to flag these, then how to fix them (it may be as simple as manually matching names).
Different PERMNO, name in call_desc does not matches (approximately) co_name. Can we deduce how we matched these? Are these found in the Google Sheets document as manual matches?

rudyardrichter commented 8 years ago

Regarding 3—it appears that none of the file_names for cases with different permno appear in manual_permno_matches:

diff_permco_calls_manual <-
    diff_permco_calls %>%
    inner_join(manual_permno_matches, by="file_name")

> count(diff_permco_calls_manual)
...
0

(From b4beb4a.)

rudyardrichter commented 8 years ago

Is there a way to source SQL code/functions like in db_matches/extract_conm.sql for use in dplyr functions?

iangow commented 8 years ago

Sorry for the delay. I am traveling right now.

This would work:

Sys.setenv(PGHOST = "aaz.chicagobooth.edu", PGDATABASE = "postgres")
query <- paste(readLines("db_matches/extract_conm.sql"), collapse="\n")
library(dplyr)
pg <- src_postgres()
library(RPostgreSQL)
dbGetQuery(pg$con, sql(query))

rudyardrichter commented 8 years ago

Is that code working for you? I keep getting the following error:

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  permission denied for schema public
)
NULL
Warning message:
In postgresqlQuickSQL(conn, statement, ...) :
  Could not create executeCREATE OR REPLACE FUNCTION extract_conm(call_desc text)
RETURNS text AS $$
    DECLARE
        regex text;
        matches text[];
    BEGIN
        -- Get text between e.g., "2004" or "2005/06" and "Results" or "Earnings"
        regex := '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)';
        matches := regexp_matches(call_desc, regex);
        RETURN matches[1];
    END;
$$ LANGUAGE plpgsql;

iangow commented 8 years ago

This may fix it:

postgres=# GRANT CREATE ON SCHEMA public TO rudyardrichter ;
GRANT

iangow commented 8 years ago

So try again.

rudyardrichter commented 8 years ago

Now instead of permission denied for schema public I get must be owner of function extract_conm

rudyardrichter commented 8 years ago

Would it be better if I just rewrote the extract_conm function in R?

iangow commented 8 years ago

Sure. You could just use PostgreSQL functions for the current version:

SELECT call_desc, 
    UNNEST(regexp_matches(call_desc, '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)'))
FROM streetevents.calls
LIMIT 10;

iangow commented 8 years ago

Or, using dplyr:

Sys.setenv(PGHOST = "aaz.chicagobooth.edu", PGDATABASE = "postgres")

library(dplyr)
pg <- src_postgres()

regex <- '(?:[0-9]{4}|[0-9]{4}/[0-9]{2})(.*)(?=Results|Earnings)'

calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))

calls %>%
    mutate(co_name_matches=regexp_matches(call_desc, regex)) %>%
    mutate(co_name_alt=sql("trim(both from co_name_matches[1])")) %>%
    select(file_name, call_desc, co_name_alt, call_desc)

rudyardrichter commented 8 years ago

Of the cases with both different permno and different permco, most (but not all) have a mismatch between co_name and the company name extracted from the call description, call_co_name:

`diff_permco_co_names_summary`

same_co_name	count
FALSE	107
TRUE	9

However, the function to compare the company name with the company name in the call description (compareNames, 64c85e0) is rather lenient, ignoring most punctuation as well as "Inc", "Ltd", "Co", etc.; there are only 3 cases with an exact match (company "Westinghouse Air Brake Technologies Corp").

The cases with different permno and permco boil down to 22 company name vs. call description company name matchings:

`diff_permco_distinct_co_names`

co_name	call_co_name	same_co_name
Accelrys Inc	Pharmacopeia	FALSE
Caesars Entertainment Corp	Park Place Entertainment	FALSE
Crestwood Midstream Partners LP	QUICKSILVER GAS SERVICES LP	FALSE
IMMC Corp	Immunicon Corporation	FALSE
Monster Beverage Corp	Hansen Natural	FALSE
Monster Beverage Corp	Hansen Natural Corp	FALSE
OfficeMax Inc	Boise Cascade	FALSE
Pilgrims Pride Corp	Pilgrim`s Pride	TRUE
Silicon Graphics International Corp	Rackable Systems, Inc.	FALSE
Six Flags Entertainment Corp	Six Flags, Inc.	FALSE
TGC Industries Inc	TGC	FALSE
TGC Industries Inc	TGC Industries Inc	TRUE
TGC Industries Inc	TGC Industries	TRUE
The Wendy's Co	Wendy's International	FALSE
The Wendy's Co	Wendy`s International	FALSE
Universal American Corp	Universal American Financial Corp.	FALSE
Universal American Corp	Universal American	TRUE
Universal American Corp	Universal American Financial	FALSE
Web.com Group Inc	Interland, Inc.	FALSE
Web.com Group Inc	Website Pros, Inc.	FALSE
Web.com Group Inc	Interland	FALSE
Westinghouse Air Brake Technologies Corp	Westinghouse Air Brake Technologies Corp	TRUE

I can alo, of course, modify the behavior of compareNames so that, for instance, the Universal American Corp cases all match.

rudyardrichter commented 8 years ago

There must be a different issue with the handful of cases with differing permno and permco but exact company name matches. Given that the number of such cases is very small it seems that these could potentially be added as manual matches. (However, I would imagine we would nonetheless want to make sure that the same issue is not contributing to the other cases with differing permno and permco.)

rudyardrichter commented 8 years ago

Another observation about the cases with different permno: the ones with same permco tend to have match_type 1, while the cases with different permco have match_type 2, 5, or 7.

`same_permco_match_type_summary`

match_type	count
1	471
2	20

`diff_permco_match_type_summary`

match_type	count
2	78
5	3
7	36

(These are from fd074a4.)

rudyardrichter commented 8 years ago

Also for reference, match_type_permno_permco_summary (from 0200f54) compares the number of entries having same_permno and/or same_permco for each match_type. The only match_type which do not have same_permno and same_permco are 1, 2, 5, and 7.

`match_type_permno_permco_summary`

match_type	same_permno	same_permco	count
0	TRUE	TRUE	5147
0	TRUE	FALSE	0
0	FALSE	TRUE	0
0	FALSE	FALSE	0
1	TRUE	TRUE	461756
1	TRUE	FALSE	0
1	FALSE	TRUE	6554
1	FALSE	FALSE	0
2	TRUE	TRUE	44199
2	TRUE	FALSE	0
2	FALSE	TRUE	502
2	FALSE	FALSE	1298
3	TRUE	TRUE	1061
3	TRUE	FALSE	0
3	FALSE	TRUE	0
3	FALSE	FALSE	0
4	TRUE	TRUE	12
4	TRUE	FALSE	0
4	FALSE	TRUE	0
4	FALSE	FALSE	0
5	TRUE	TRUE	899
5	TRUE	FALSE	0
5	FALSE	TRUE	0
5	FALSE	FALSE	48
7	TRUE	TRUE	3125
7	TRUE	FALSE	0
7	FALSE	TRUE	0
7	FALSE	FALSE	1354
8	TRUE	TRUE	1
8	TRUE	FALSE	0
8	FALSE	TRUE	0
8	FALSE	FALSE	0
9	TRUE	TRUE	1481
9	TRUE	FALSE	0
9	FALSE	TRUE	0
9	FALSE	FALSE	0
10	TRUE	TRUE	8
10	TRUE	FALSE	0
10	FALSE	TRUE	0
10	FALSE	FALSE	0

iangow commented 8 years ago

There must be a different issue with the handful of cases with differing permno and permco but exact company name matches. Given that the number of such cases is very small it seems that these could potentially be added as manual matches. (However, I would imagine we would nonetheless want to make sure that the same issue is not contributing to the other cases with differing permno and permco.)

The question is: How can we identify these cases in the first place? What triggered the decision to override the initial match in the first place? Then, if these cases are not manual matches, how were they matched to permno/permco values?

From my comment above:

In summary, we have three cases right now.

Same PERMCO, different PERMNO. Ignore for now. We will later need to work out how to choose the PERMNO (if necessary to choose).

Different PERMNO and name in call_desc does not match co_name. We need to work out first how to flag these, then how to fix them (it may be as simple as manually matching names).

Different PERMNO, name in call_desc does not matches (approximately) co_name. Can we deduce how we matched these? Are these found in the Google Sheets document as manual matches?

Oops. Item 3 should be " Different PERMNO, name in call_desc ~~does not~~ matches (approximately) co_name." Clearly, if the company name matches, some other basis was used to replace the initially matched permno.

Regarding item 2, I wonder how many "false positives" the code you're testing would produce. How many cases have the same PERMNO, but same_co_name==FALSE? If there are many, would it be feasible to go through and identify ones that do not require further investigation? (For example, a Google Sheets document could be created with co_name and call_co_name and an investigate column. If there are thousands of such cases, I think that would be too much work, but a few hundred might be pretty easy to go through. It would be necessary to have a way of keeping track of cases (file_name or co_name values) that have been evaluated in this way so these task is not repeated for the same calls (and we might consider the matches in the current crsp_link to have evaluated in this way).

rudyardrichter commented 8 years ago

To answer your last question—there are 8541 cases with same_permno but not same_co_name:

`permco_co_name_summary` (9b2c190)

same_permno	same_co_name	count
TRUE	TRUE	66485
TRUE	FALSE	8541
FALSE	TRUE	469
FALSE	FALSE	137

…so a manual identification for these seems infeasible.

iangow commented 8 years ago

I think you can reduce that number dramatically by grouping by co_name and call_co_name. I made a Google Sheets document here that I think would be pretty easy to go through. I did about a half of it, so if you could take a look at the rest ...

I just pushed the code that I used to make the underlying data.

Because I did not condition on the existence of a match, if we fill out the spreadsheet, we could evaluate whether doing so will flag mis-matches based on what we have in streetevents.crsp_link. And because I retained the matching file_name values, it is easy to keep track of what has been checked.

rudyardrichter commented 8 years ago

To clarify, under what conditions should an entry be marked investiage=TRUE? If co_name and call_co_name do not refer to the same company?

iangow commented 8 years ago

If based on the name there is a real risk that they refer to different companies. We will need to work out how to "investigate" but that should be fairly easy.

rudyardrichter commented 8 years ago

What should be done for the cases in which a company changed its name from one to the other? investigate=FALSE?

And also cases where:

One name refers to a parent company, and the other to a child company
Company merged with or acquired by another

iangow commented 8 years ago

Ultimately, we are trying to map each call (file_name) to permno. This step is a piece of that puzzle.

If a company changes name, it doesn't change its permno. So if we know that this is what happened (e.g., Kraft Foods become Mondelez), then set investigate to FALSE.

If the call is by a subsidiary, but the permno is for the parent, then I think we don't want these to match. Set investigate to TRUE and, if confirmed, we can either manually "match" the file_name to no permno (if none exists) or to the subsidiary's permno (if one exists).

If a company is merged with or acquired by another, then set investigate to TRUE. We will need to investigate and set the permno correctly.

I will need to follow up with more details on the "investigation" process; for now, we can just flag cases for future investigation.

rudyardrichter commented 8 years ago

Thanks for the guidance. I've been going through my section of name_matches.csv again to mark potential disparities with comments detailing the problem.

For the sake of thorough documentation (x and y refer to co_name or call_co_name interchangeably), these are the comments I'm using:

Comment	Investigate	Details
name change	`FALSE`	Company changed its name from one to the other
bad call_co_name parse	`FALSE`	call_co_name was not parsed correctly, but the names refer to the same company
illegible call_co_name parse	`TRUE`	call_co_name was not parsed correctly and the correct name cannot be inferred
`x` is subsidiary	`TRUE`	`x` is a subsidiary of `y`
`x` formed from merger of `y`	`TRUE`	`y` merged with a different company to form `x`
`x` acquired `y`	`TRUE`	Company in `x` acquired company in `y`
missing `x`	`TRUE`	The field for `x` is empty

Multiple comments for one entry are separated with semicolons.

iangow commented 8 years ago

@rudyardrichter

Thanks for the update. I think that the number of cases to investigate is quite manageable. I pushed some code (see link to commit above this comment).

I think the database table crsp.stocknames could be a good resource for resolving issues.

Let me know if you have any questions.

-Ian

rudyardrichter commented 8 years ago

I am a little confused about the type 1 cases in investigation_cases.R:

# Type 1: Observations on crsp_link already.
# Here we need to work out which PERMNO is correct and add to
# manual_permno_matches spreadsheet

If the rows of name_matches.csv already have the same permno, as per match_compare_plus in match_analysis.Rmd, then what does checking the correct permno amount to here? Verifying that the shared permno is correct?

Regarding the type 2 cases: how should the permno from company_link be verified? Is this where stocknames comes in?

iangow commented 8 years ago

Type 1:

As for Type 2 cases:

# If the PERMNO is confirmed correct, then it may be easiest to change
# "investigate" to FALSE in the "investigation" Google Sheets document, perhaps
# noting there that the case has been investigated.

But it may make sense to check some cases with the same PERMNO in both tables. I'm not sure how good the earlier code was.

Type 2:

I think stocknames will help a lot. Many name changes will be documented there. If the names match to the same PERMNO, then we are probably safe to confirm the match. It's really only when we need to match to a different PERMNO that a manual match is needed.

rudyardrichter commented 8 years ago

How should the permno in company_link be checked against stocknames---given that stocknames does not have an obvious choice to join by? If the permno occurs in the same row as the ticker (or company name) corresponding to that company, does that confirm the match (in which case, I suppose that would be the appropriate variable to join by)?

iangow commented 8 years ago

How should the permno in company_link be checked against stocknames---given that stocknames does not have an obvious choice to join by? If the permno occurs in the same row as the ticker (or company name) corresponding to that company, does that confirm the match (in which case, I suppose that would be the appropriate variable to join by)?

This will not be an automatic join, but confirming an addition to the manual-match spreadsheet.

It might be easier to address your question if you have an example of a match you are considering. But in general, if co_name or call_co_name (or variants thereon) are associated with the same permno and the same ticker, then I think that's a valid match. This can be implemented by either adding the match to the manual-match Google Sheets document or removing the investigate flag from the investigation sheet.

rudyardrichter commented 8 years ago

This will not be an automatic join, but confirming an addition to the manual-match spreadsheet.

Right, I just meant in terms of being able to compare type_1_cases or type_2_cases against stocknames without having to filter for each entry by hand.

I was wondering about the instruction in Type 2:

Here we need to see if the PERMNO on company_link is correct

For an example, take this entry in type_2_cases:

co_name	call_co_name	num_calls	investigate	comment	file_name	permno
Hologic Inc	CYTYC	1	TRUE	call_co_name is subsidiary	648370_T	83251

Searching in stocknames for CYTYC yields this entry:

permno	permco	namedt	nameenddt	cusip	ncusip	ticker	comnam	hexcd	exchcd	siccd	shrcd	shrcls	st_date	end_date	namedum
83251	14414	1996-03-08	2007-10-22	23294610	23294610	CYTC	CYTYC CORP	3	3	3820	11	NA	1996-03-29	2007-10-31	2

If the permno for Hologic and the permno for CYTYC are the same (83251), does that mean that we are safe to confirm the match?

iangow commented 8 years ago

We don't need to worry about this one, as it appears to be already a manual match on crsp_link.

> company_link %>% filter(file_name=='648370_T')
Source:   query [?? x 2]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  file_name permno
      <chr>  <int>
1  648370_T  83251
> crsp_link %>% filter(file_name=='648370_T')
Source:   query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  file_name permno match_type   match_type_desc
      <chr>  <int>      <int>             <chr>
1  648370_T  83251          0 0. Manual matches

The issues are really when we have differences between crsp_link and company_link. To recap, company_link is the "gold standard" of sorts here (though we should be open to the possibility of errors therein), but the process for creating it is lost to time, etc., and we have no way of extending it.

Thus we want to use the process of "investigation" outlined above to get crsp_link to reflect the information in company_link and the process for creating it. Then we can extend the table to cover file_name values not covered by company_link (this table was based on StreetEvents a couple of years ago).

Once we've sorted this out, hopefully we'll have a process for keeping crsp_link up to date (and we can document accordingly). And after that, we will need to rope in another data source to link individual speakers on StreetEvents to a data set of executives (Equilar).

iangow commented 8 years ago

So a better example might be one where crsp_link and company_link disagree.

Building on code in investigation_cases.R, here's what I see:

> crsp_link %>% 
+     inner_join(company_link, by="file_name") %>%
+     filter(permno.x != permno.y) %>% 
+     collect() %>% 
+     left_join(investigation) %>%
+     count(investigate)
Joining, by = "file_name"
# A tibble: 3 x 2
  investigate     n
        <lgl> <int>
1       FALSE   131
2        TRUE     6
3          NA   471

Focusing on the investigate cases:

> crsp_link %>% 
+     inner_join(company_link, by="file_name") %>%
+     filter(permno.x != permno.y) %>% 
+     collect() %>% 
+     left_join(investigation) %>%
+     filter(investigate)
Joining, by = "file_name"
# A tibble: 6 x 11
  file_name permno.x match_type                                match_type_desc permno.y               co_name        call_co_name num_calls
      <chr>    <int>      <int>                                          <chr>    <int>                 <chr>               <chr>     <int>
1 1742698_T    75993          2 2. Roll matches back & forward in StreetEvents    90984     Web.com Group Inc  Website Pros, Inc.         3
2 1826108_T    75993          2 2. Roll matches back & forward in StreetEvents    90984     Web.com Group Inc  Website Pros, Inc.         3
3 1902561_T    75993          2 2. Roll matches back & forward in StreetEvents    90984     Web.com Group Inc  Website Pros, Inc.         3
4 4004472_T    84342          2 2. Roll matches back & forward in StreetEvents    88031 Monster Beverage Corp Hansen Natural Corp         3
5 4166998_T    84342          2 2. Roll matches back & forward in StreetEvents    88031 Monster Beverage Corp Hansen Natural Corp         3
6 4230684_T    84342          2 2. Roll matches back & forward in StreetEvents    88031 Monster Beverage Corp Hansen Natural Corp         3
# ... with 3 more variables: investigate <lgl>, investigated <chr>, comment <chr>

Digging deeper into the first case, it seems there was some kind of merger that occurred after the call and StreetEvents (unhelpfully) updated the ticker. So we need to fix this by manually matching file_name=1742698_T to permno=90984.

> stocknames <- tbl_pg("crsp.stocknames")
> stocknames %>% filter(permno==75993L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker                 comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date
   <int>  <int>     <date>     <date>    <chr>    <chr>  <chr>                  <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>
1  75993  10311 1989-09-25 1995-04-09 94732Q10 98943210   ZEOS ZEOS INTERNATIONAL LTD     3      3  3570    11   <NA> 1989-09-29 2007-09-28
2  75993  10311 1995-04-10 2001-08-06 94732Q10 59510010   MUEI MICRON ELECTRONICS INC     3      3  3570    11   <NA> 1989-09-29 2007-09-28
3  75993  10311 2001-08-07 2003-08-03 94732Q10 45872710   INLD      INTERLAND INC NEW     3      3  3570    11   <NA> 1989-09-29 2007-09-28
4  75993  10311 2003-08-04 2006-03-19 94732Q10 45872720   INLD      INTERLAND INC NEW     3      3  3570    11   <NA> 1989-09-29 2007-09-28
5  75993  10311 2006-03-20 2007-09-28 94732Q10 94732Q10   WWWW            WEB COM INC     3      3  3570    11   <NA> 1989-09-29 2007-09-28
# ... with 1 more variables: namedum <dbl>
> stocknames %>% filter(permno==90984L)
Source:   query [?? x 16]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  permno permco     namedt  nameenddt    cusip   ncusip ticker            comnam hexcd exchcd siccd shrcd shrcls    st_date   end_date namedum
   <int>  <int>     <date>     <date>    <chr>    <chr>  <chr>             <chr> <dbl>  <dbl> <dbl> <dbl>  <chr>     <date>     <date>   <dbl>
1  90984  48782 2005-11-02 2008-06-08 94733A10 94769V10   WSPI  WEBSITE PROS INC     3      3  7372    11   <NA> 2005-11-30 2014-06-30       2
2  90984  48782 2008-06-09 2008-10-26 94733A10 94769V10   WWWW  WEBSITE PROS INC     3      3  7372    11   <NA> 2005-11-30 2014-06-30       2
3  90984  48782 2008-10-27 2014-06-30 94733A10 94733A10   WWWW WEB COM GROUP INC     3      3  7372    11   <NA> 2005-11-30 2014-06-30       2
> calls <- tbl(pg, sql("SELECT * FROM streetevents.calls"))
> calls %>% filter(file_name=='1742698_T')
Source:   query [?? x 9]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

                                                             file_path file_name ticker           co_name
                                                                 <chr>     <chr>  <chr>             <chr>
1 StreetEvents_historical_backfill_through_May2013/dir_1/1742698_T.xml 1742698_T   WWWW Web.com Group Inc
# ... with 5 more variables: call_desc <chr>, call_date <time>, city <chr>, call_type <int>, last_update <time>
> calls %>% filter(file_name=='1742698_T') %>% select(call_desc, call_date, ticker)
Source:   query [?? x 3]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

                                            call_desc           call_date ticker
                                                <chr>              <time>  <chr>
1 Q4 2007 Website Pros, Inc. Earnings Conference Call 2008-02-12 22:00:00   WWWW

An interesting aspect of this case is that there appear to be two permno values associated with this call in crsp_link:

> crsp_link %>% filter(file_name=='1742698_T')
Source:   query [?? x 4]
Database: postgres 9.4.2 [igow@aaz.chicagobooth.edu:5432/postgres]

  file_name permno match_type                                match_type_desc
      <chr>  <int>      <int>                                          <chr>
1 1742698_T  90984          2 2. Roll matches back & forward in StreetEvents
2 1742698_T  75993          2 2. Roll matches back & forward in StreetEvents

iangow commented 8 years ago

So I think I addressed the six cases above by adding row to the manual match sheet.

> source('db_matches/import_manual_permno_matches.R')
Sheet successfully identified: "streetevents.manual_permno_matches"
Accessing worksheet titled 'manual_permno_matches'.
Downloading: 29 kB      
No encoding supplied: defaulting to UTF-8.
> system("psql -f db_matches/crsp_link.sql")
DROP TABLE
SELECT 275863
ALTER TABLE
CREATE INDEX
> crsp_link %>% 
+     inner_join(company_link, by="file_name") %>%
+     filter(permno.x != permno.y) %>% 
+     collect() %>% 
+     left_join(investigation) %>%
+     filter(investigate) %>%
+     select(file_name)
Joining, by = "file_name"
# A tibble: 0 x 1
# ... with 1 variables: file_name <chr>

iangow / streetevents_private

Identify issues in StreetEvents-CRSP links. #1

Matching StreetEvents to CRSP

StreetEvents data

Comparing `streetevents.crsp_link` and `streetevents.company_link`

`diff_permco_co_names_summary`

`diff_permco_distinct_co_names`

`same_permco_match_type_summary`

`diff_permco_match_type_summary`

`match_type_permno_permco_summary`

`permco_co_name_summary` (9b2c190)

Type 1:

Type 2:

iangow / streetevents_private

Identify issues in StreetEvents-CRSP links. #1

Matching StreetEvents to CRSP

StreetEvents data

Comparing streetevents.crsp_link and streetevents.company_link

diff_permco_co_names_summary

diff_permco_distinct_co_names

same_permco_match_type_summary

diff_permco_match_type_summary

match_type_permno_permco_summary

permco_co_name_summary (9b2c190)

Type 1:

Type 2:

Comparing `streetevents.crsp_link` and `streetevents.company_link`

`diff_permco_co_names_summary`

`diff_permco_distinct_co_names`

`same_permco_match_type_summary`

`diff_permco_match_type_summary`

`match_type_permno_permco_summary`

`permco_co_name_summary` (9b2c190)