Handle filings with multiple CUSIPs

mccgr / edgar

Code to manage data related to SEC EDGAR

31 stars 15 forks source link

Handle filings with multiple CUSIPs #77

Closed iangow closed 4 years ago

iangow commented 4 years ago

I believe that there will only be one valid CUSIP in any given filing.

In this case, one CUSIP is valid and the other isn't. Perhaps the right one is found earlier, but this may be difficult to implement with the current approach. Perhaps one found using more methods or found more times is the one to keep.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik_test")

cusip_cik %>%
  filter(cusip=="929903102") %>% 
  count(cik) %>% 
  arrange(desc(n))
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>        cik n      
#>      <int> <int64>
#>  1 1209466 324    
#>  2 1100663 184    
#>  3 1064642 114    
#>  4 1350487  88    
#>  5   52848  84    
#>  6  930667  50    
#>  7 1064641  46    
#>  8 1378872  42    
#>  9 1125816  38    
#> 10  893818  34    
#> # … with more rows

cusip_cik %>% 
  filter(file_name == "edgar/data/1071342/0001074683-03-000146.txt") %>%
  select(-file_name)
#> # Source:   lazy query [?? x 5]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   cusip     check_digit     cik company_name formats
#>   <chr>           <int>   <int> <chr>        <chr>  
#> 1 929903102           2 1071342 NETSOLVE INC D      
#> 2 64115J106           6 1071342 NETSOLVE INC ABC

^{Created on 2020-06-19 by the reprex package (v0.3.0)}

bdcallen commented 4 years ago

@iangow Yes, I've been aware of there being multiple cusips in some filings for a while

I believe that there will only be one valid CUSIP in any given filing.

I think this is strictly not true. There are other cases, such as this one, where both cusips are displayed in the title section, and there are two names in the security name section for each one, which seems to imply that both are legitimate. I would agree, however, that there should only be one valid cusip for each security mentioned in a filing.

Then again, there are the cases like the one you mentioned above, where a filer incomprehensively states an old or invalid cusip for a security, along with the correct one, which is rather frustrating. My hunch is that this problem could be difficult to solve, for the simple reason that some firms are probably not that consistent with where they put the correct cusip. I mean on what basis do we choose where the valid cusips are. On the title page? On the cover pages (the sections with questions 1 to 14 on 13d, 1 to 12 for 13g)? In the item section (such as Item 2(e), which appears in 13G)? And what if a firm consistently quotes the wrong cusip number in the same filing? Then the frequency would not be so helpful.

bdcallen commented 4 years ago

@iangow Have a look at this filing. The erroneous cusip number is in fact a valid cusip, but for the filer of the filing you quoted above, Wachovia Corp, not the subject. So Wachovia Corp have erroneously thought that items 2(d) and 2(e) apply to them, so they have quoted the cusip for their common stock.

bdcallen commented 4 years ago

@iangow I just had a closer look at the example I gave with two cusips. Both the cusips are legitimate. The first cusip appears in stocknames

> stocknames %>% filter(ncusip == "68029310")
# Source:   lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
  permno permco namedt     nameenddt  cusip    ncusip   ticker comnam         hexcd exchcd siccd shrcd shrcls st_date    end_date   namedum
   <int>  <int> <date>     <date>     <chr>    <chr>    <chr>  <chr>          <dbl>  <dbl> <dbl> <dbl> <chr>  <date>     <date>       <dbl>
1  59409   3416 1974-12-02 1993-01-28 68029310 68029310 OSTN   OLD STONE CORP     3      3  6711    11 NA     1974-12-31 1993-01-29       2

whilst the second, while not appearing in stocknames, does appear in cusipm.issue

crsp=# SELECT issuer_num, issue_num, issue_check, issue_desc_1, issue_desc_2
FROM cusipm.issue WHERE cusip8 = '68029330';
 issuer_num | issue_num | issue_check |     issue_desc_1     | issue_desc_2
------------+-----------+-------------+----------------------+--------------
 680293     | 30        | 5           | PFD SER B CONV $2.40 |
(1 row)

crsp=# SELECT *
FROM cusipm.issuer
crsp-# WHERE issuer_num = '680293';

 issuer_num | issuer_check | issuer_name_1  | issuer_name_2 | issuer_name_3 | issuer_adl_1 | issuer_adl_2 | issuer_adl_3 | issuer_adl_4 |        issuer_sort_key         | issuer_type | issuer_status | issuer_del_date | issuer_transaction | issuer_state_code | issuer_update_date |  cabre_id  | cabre_status | lei_cici |   legal_entity_name   | previous_name | entry_date | cp_institution_type
------------+--------------+----------------+---------------+---------------+--------------+--------------+--------------+--------------+--------------------------------+-------------+---------------+-----------------+--------------------+-------------------+--------------------+------------+--------------+----------+-----------------------+---------------+------------+---------------------
 680293     | 8            | OLD STONE CORP |               |               |              |              |              |              | OLD STONE CORP           00500 | C           | A             |                 |                    |                   |                    | US06802930 | A            |          | Old Stone Corporation |               | 1993-10-31 |
(1 row)

iangow commented 4 years ago

@bdcallen Great. I think we're making progress here.

So if you could look through 5-6 sets of CIKs that match to multiple CUSIPs in a individual filings to see whether there are any valid matches that we'd lose by excluding them, that would be great.

Obviously, ignore cases where there are obvious errors in the matched CUSIPs. See below for more discussion and some code (note that you can use mutate(cusip=unnest(cusps)) to turn an array of CUSIPs back into individual rows).

So we have (so far) identified are two cases with multiple CUSIPs in a filing:

Bad CUSIP-CIK matches (e.g., Wachovia above).
Cases where there are multiple good CUSIP-CIK matches (e.g., Old Stone Corp above).

I think the challenge would be in distinguishing these two cases. My sense is that this will be difficult and, unless we have evidence that using these provides incremental matches, I think we should just ditch all cases with multiple valid CUSIPs.

Perhaps it makes sense to look through a few cases. From the data below, I think multiple CUSIPs are often due to errors. Occasional errors are fine for the following reasons:

We only use these data to identify valid CUSIP-CIK matches.
Only if a match appears in a critical number of filings (I think 10 is about the right number), do we use it.
Errors would tend to be made by the odd firm here and there and often just once or twice. (The matches with CUSIPs of 65373J20,6573J20 [one of these must be bad] get close to being repeated enough to cause problems.)
If we lose a valid match in a given filing, odds are we find it in some other filing.
Matches to bad CUSIPs will not create problems. If our code matches as CIK of 74273 to a CUSIP of BENCALLEN, then we'd have no problem because no database is going to have an entry for a CUSIP of BENCALLEN. Matching to valid CUSIP that is the wrong CUSIP could create problems. Though to be honest, it's not hard to investigate a case where one CIK matches to two different valid CUSIPs.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik_test")

cusip_cik %>%
  mutate(cusip8 = substr(cusip, 1, 8)) %>%
  filter(cusip8 == "68029310" | cik == 74273L) %>%
  count(cusip, cik)
#> # Source:   lazy query [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Groups:   cusip
#>   cusip         cik n      
#>   <chr>       <int> <int64>
#> 1 337400105   74273  2     
#> 2 680293107   74273 46     
#> 3 680293107 1005730  2     
#> 4 680293305   74273 44     
#> 5 <NA>        74273  4

mult_cusips <-
  cusip_cik %>%
  distinct(file_name, cusip) %>%
  group_by(file_name) %>%
  summarize(num_cusips = n(),
            cusips = array_agg(cusip)) %>%
  filter(num_cusips > 1) %>%
  ungroup() %>%
  compute()

mult_cusips %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 20993

mult_cusips
#> # Source:   table<dbplyr_005> [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name                                   num_cusips cusips               
#>    <chr>                                       <int64>    <chr>                
#>  1 edgar/data/1000015/0000912057-01-004618.txt 2          {591002100,M81867109}
#>  2 edgar/data/1000015/0000912057-01-004619.txt 2          {591002100,M81867109}
#>  3 edgar/data/1000015/0000943152-97-000348.txt 2          {00431L10,59100210}  
#>  4 edgar/data/1000045/0000897069-01-500505.txt 2          {65373J20,6573J20}   
#>  5 edgar/data/1000045/0000897069-01-500506.txt 2          {65373J20,6573J20}   
#>  6 edgar/data/1000045/0000897069-01-500507.txt 2          {65373J20,6573J20}   
#>  7 edgar/data/1000045/0000897069-02-000937.txt 2          {65373J20,6573J20}   
#>  8 edgar/data/1000095/0000950135-98-002266.txt 2          {265591016,826559106}
#>  9 edgar/data/1000095/0001058217-98-000105.txt 2          {097189104,296443104}
#> 10 edgar/data/1000097/0000919574-96-000814.txt 2          {9302947,C573105}    
#> # … with more rows

^{Created on 2020-06-19 by the reprex package (v0.3.0)}

bdcallen commented 4 years ago

@iangow How do you think we should deal with an example like this? On this example, I did

i = 3000
get_cusip_cik(cases_df.loc[i, 'file_name'])

and the result is

file_name	cusip	check_digit	cik	company_name	formats
edgar/data/315032/0000315032-94-000005.txt	000886101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	039483102	2	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	018490102	2	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	009158106	6	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	097383103	3	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	090613100	0	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	059685107	7	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	066704107	7	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	05564E106	6	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	139859102	2	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	252165105	5	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	292845104	4	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	390568103	3	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	219350105	5	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	361448103	3	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	422893107	7	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	438092108	8	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	423452101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	459506101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	481088102	2	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	482620101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	530175108	8	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	579780206	6	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	602720104	4	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	651426108	8	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	670346105	5	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	740189105	5	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	730448107	7	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	709903108	8	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	781088109	9	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	775133101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	754603108	8	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	783759103	3	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	826552101	1	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	790161103	3	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	817338106	6	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A
edgar/data/315032/0000315032-94-000005.txt	929160109	9	315032	STATE FARM MUTUAL AUTOMOBILE INSURANCE CO	A

If you look at the actual filing, it seems to be a bunch of 13G's stuck together, but the filer has lazily put the subject cik and name to that of itself.

For the record, in cusip_cik_old, the entry for this filing is

crsp=# SELECT * FROM edgar.cusip_cik_old
crsp-# WHERE file_name = 'edgar/data/315032/0000315032-94-000005.txt';

                 file_name                  |   cusip   |  cik   |               company_name                | format
--------------------------------------------+-----------+--------+-------------------------------------------+--------
 edgar/data/315032/0000315032-94-000005.txt | 000886101 | 315032 | STATE FARM MUTUAL AUTOMOBILE INSURANCE CO | A
(1 row)

iangow commented 4 years ago

@bdcallen I think you're asking: "How do you think we should deal with filings that combine a number of 13Gs and where the filer has put the its own CIK and name as the subject CIK and name?"

I think it depends how these show up. If they yield cases where one file_name has multiple CIK-CUSIP combinations, then we end up throwing those out. If these produce CUSIPs that map to multiple CIKs, then I think we can flag those later and sort them out in a different process. Either way, the issues should be evident from the data in edgar.cusip_cik and we should make an issue to sort out the problem this creates. What I don't think make sense is to try to parse out information that really can't be parsed; in this case "fixing" the CIKs can't be done.

bdcallen commented 4 years ago

@iangow

What I don't think make sense is to try to parse out information that really can't be parsed; in this case "fixing" the CIKs can't be done.

We could in theory extract the issuer names in the title sections of each part, and try to extract the cik by searching for it by name, but I don't think this is worth the effort for this table, since we likely have the same cik-cusip matches from other forms, and that matching records by company name can be messy.

iangow commented 4 years ago

Yes, this should just be extracting CIKs and CUSIPs. Matching names is a bridge too far for sure.

bdcallen commented 4 years ago

@iangow

I just made these definitions in RStudio

cusip_cik <- tbl(pg, sql("SELECT * FROM edgar.cusip_cik"))
multi_cusips <- cusip_cik %>% group_by(file_name) %>% summarise(num_cusips = n()) %>% filter(num_cusips > 1) %>% inner_join(cusip_cik) 
single_cusips <- cusip_cik %>% group_by(file_name) %>% summarise(num_cusips = n()) %>% filter(num_cusips == 1) %>% inner_join(cusip_cik) 
distinct_multi <- multi_cusips %>% distinct(cik, cusip)
distinct_single <- single_cusips %>% distinct(cik, cusip)

and then did the following computations

> cusip_cik %>% distinct(cik) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 26898
> distinct_single %>% distinct(cik) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 26684
> distinct_multi %>% anti_join(distinct_single, by = 'cik') %>% distinct(cik) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1   214
> distinct_multi %>% distinct(cik) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  7017

So only 214 out of the 26898 filings in cusip_cik cannot be found amongst filings which produce a single cusip-cik match, and can only be found amongst filings in which there is more than one cusip found.

Doing the analogous for cusips

> cusip_cik %>% distinct(cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 47924
> distinct_single %>% distinct(cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 42954
> distinct_multi %>% anti_join(distinct_single, by = 'cusip') %>% distinct(cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  4970

it seems around 10% of cusips can only be found in filings producing multiple cusips.

Finally, doing the same for (cik, cusip) pairs, I got

> cusip_cik %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 68135
> distinct_single %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 57719
> distinct_multi %>% anti_join(distinct_single, by = c('cik', 'cusip')) %>% distinct(cik, cusip) %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1 10416

So around 14% of pairs can only be found amongst the filings producing multiple cusips.

Then I did

> cik_only_mult <- distinct_multi %>% anti_join(distinct_single, by = 'cik') %>% distinct(cik) 
> cusip_only_mult <- distinct_multi %>% anti_join(distinct_single, by = 'cusip') %>% distinct(cusip) 
> distinct_multi %>% anti_join(distinct_single, by = c('cik', 'cusip')) %>% distinct(cik, cusip) %>% anti_join(cik_only_mult) %>% anti_join(cusip_only_mult) %>% count()
Joining, by = "cik"
Joining, by = "cusip"
# Source:   lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1  5022

So around half of the 10000 so cases only found in filings with multiple cusips are those between ciks that appear in distinct_single and cusips that also appear in distinct_single. It would be interesting to see exactly how many of these are mistakes.

bdcallen commented 4 years ago

@iangow I then did this

multi_cusips <- multi_cusips %>% mutate(cusip6 = substr(cusip, 1, 6))
num_cusip6_df <- multi_cusips %>% distinct(file_name, cusip6) %>% group_by(file_name) %>% summarise(num_cusip6 = n())
multi_cusips_df <- multi_cusips %>% inner_join(num_cusip6_df, by = 'file_name') %>% collect()

and this

> multi_cusips_df %>% filter(num_cusip6 > 1) %>% distinct(file_name) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 28433
> multi_cusips_df %>% filter(num_cusip6 == 1) %>% distinct(file_name) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  8958
> multi_cusips_df %>% distinct(file_name) %>% count()
# A tibble: 1 x 1
      n
  <int>
1 37391

> multi_cusips_df %>% filter(num_cusip6 == 1) %>% group_by(file_name) %>% summarise(diff = n() - sum(num_cusip_char == 9)) %>% filter(diff > 0) %>% count()
# A tibble: 1 x 1
      n
  <int>
1  5568

I had already mentioned that I think there should one cusip per security mentioned in each 13D and 13G. But I would further add to this that, assuming there is only one subject company per filing, there should only be one cusip6 per filing, which should be common to all securities mentioned in the filing. The cusip6, the first 6 characters of a cusip, identifies the issuer, hence there should only be one.

In the analysis above, I have shown that around 70% of cases with multiple cusips have more than one cusip6. Further, amongst the 30 percent that had one cusip6, around 60% had cusips with less than 9 characters. I think the next thing to look at amongst the remaining cases is how many are due to a mispelling.

bdcallen commented 4 years ago

@iangow I guess the main question I have so far in the analysis above, is are we willing to cut our losses in getting rid of cases with multiple cusips. Only for a minutely small number of ciks (214, to be exact, out of 26898) is it necessary to process the filings with multiple cusips to make a mapping to a cusip at all. From thereon, I'm attempting to prove out of the remaining cases where there is precisely one cusip6, that these too are mostly mistakes (such has having typos, missing characters and so on).

bdcallen commented 4 years ago

@iangow I am perhaps wrong about the assumption that each cik should be one to one with a cusip6

crsp=# SELECT * FROM cusipm.issuer
WHERE issuer_num IN ('06738C', '06739F');

 issuer_num | issuer_check |  issuer_name_1  | issuer_name_2 | issuer_name_3 |       issuer_adl_1        | issuer_adl_2 |        issuer_adl_3        | issuer_adl_4 |        issuer_sort_key         | issuer_type | issuer_status | issuer_del_date | issuer_transaction | issuer_state_code | issuer_update_date | cabre_id | cabre_status | lei_cici | legal_entity_name | previous_name | entry_date | cp_institution_type
------------+--------------+-----------------+---------------+---------------+---------------------------+--------------+----------------------------+--------------+--------------------------------+-------------+---------------+-----------------+--------------------+-------------------+--------------------+----------+--------------+----------+-------------------+---------------+------------+---------------------
 06738C     | 0            | BARCLAYS BK PLC |               |               | FOR NEW EQUITY ISSUES SEE | 06739F       | FOR DEBT ISSUES SEE 06739G |              | BARCLAYS BANK PLC        00500 | C           | A             |                 |                    |                   |                    |          |              |          |                   |               | 1993-10-31 |
 06739F     | 2            | BARCLAYS BK PLC |               |               |                           |              |                            |              | BARCLAYS BANK P L C      00500 | C           | A             |                 |                    |                   |                    |          |              |          |                   |               | 2005-08-24 |
(2 rows)

Still, it should hold for most cases amongst 13D and 13G filings.

bdcallen commented 4 years ago

@iangow

My sense is that this will be difficult and, unless we have evidence that using these provides incremental matches, I think we should just ditch all cases with multiple valid CUSIPs.

I'm also leaning towards this now as well, for the purpose of this table. I think the original motivation for you in using Schedule 13D and 13G was that usually there is a single issuer, and usually a single Cusip number to go along with it. For other filings where cusips appear, this is not the case, such as 13F, in which a single filer lists cusips for sometimes hundreds of securities, without also listing the issuer ciks, making the mapping we're interested in impossible.

I have shown in the above that less than 1% of ciks in the cusip_cik are unmapped, if we focus on those with single cusips. I think this is a small cost, especially when put beside the alternate cost of matching ciks to the wrong cusip.

I will rewrite the code to omit these cases. I think the work that remains is to check that we don't also have wrong (as opposed to just bad) cusips mapped to ciks in the remaining filings.

bdcallen commented 4 years ago

@iangow This fix seems to be simple enough. This is the function which currently extracts the cusips given the text of a filings

def extract_cusips(text):

    cusip_hdr = r'CUSIP\s+(?:NO\.|#|NUMBER)[:]?'
    cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,12})'

    regex_dict = {'A': cusip_fmt + r'[\s\r\t\n]*[_\.-]?\s*(?:[_\.-]{9,})?[\s\r\t\n]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'B': cusip_fmt + '[\s\t\r]*[\n]?' + r'[\s\t\r]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'C': '[\s_]+' + cusip_hdr + '[ _]{0,50}' + cusip_fmt + '\s+',
                  'D': '[\s_]+' + cusip_hdr + '(?:\n[\s_]{0,50}){1,2}' + cusip_fmt + '\s+'
                 }

    df_list = []

    for key, regex in regex_dict.items():

        matches = re.findall(regex, text.upper())

        cusips = [re.sub('[^0-9A-Z]', '', match) for match in matches if len(match) > 0]
        check_digits = [calculate_cusip_check_digit(cusip) for cusip in cusips]

        if(len(cusips)):
            df = pd.DataFrame({'cusip': cusips, 'check_digit': check_digits})
            df['format'] = key
            df = df[["cusip", "check_digit", "format"]]

        else:
            df = pd.DataFrame({"cusip": [], "check_digit": [], "format": []})

        df_list.append(df)

    full_df = pd.concat(df_list)

    if(full_df.shape[0]):

        formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))
        full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])
        full_df = full_df[["cusip", "check_digit", "formats"]]
        full_df = full_df.drop_duplicates().reset_index(drop = True)

        return(full_df)

    else:

        full_df = pd.DataFrame({"cusip": [None], "check_digit": [None], "formats": [None]})

    return(full_df)

Seems to me we just have to change the condition if(full_df.shape[0]) to if(full_df.shape[0] == 1). Though perhaps you might prefer a slightly different output for the cases with more than one cusip to those that have none?

bdcallen commented 4 years ago

@iangow Actually, the above solution overcorrects. if(full_df.shape[0] and len(full_df['cusip'].unique()) == 1) seems to be the correct condition.

iangow commented 4 years ago

@bdcallen:

It's important to evaluate what valid CUSIP-CIK matches we'd lose with various approaches to ditching data from "multiple CUSIP" filings.

I think for now, we should stop futzing around with the Python code (important note: I assume that edgar.cusip_cik has the multiple-CUSIP data in it for now) and evaluate different rules using the PostgreSQL data, identify the ones that look best then implement them in the Python code. In terms of what "looks best", I think that if there is one check-digit-valid nine-digit CUSIP in a file_name then we might want the code to keep that and ditch the rest. Also (looking at the very bottom), if there are no valid CUSIPs beginning PAGE, then we should consider ditching those (evaluate effects using PostgreSQL, then ditch in the Python code).

Details

Many of the cases with "multiple CUSIPs" are just different versions of the same CUSIP (e.g., a six-digit partial match, and a nine-digit complete match). In these cases, we should keep the eight- or nine-digit match for a filing and then ditch the six-digit one.

I think if we have multiple distinct CUSIPs (evaluated as distinct if substr(x, 1, 6) is different), then perhaps we should either throw away the results for that file_name or keep the results and flag them in some way (e.g., add a column such as mult_cusips with boolean type) so we can throw them away later. I think the latter option would be easiest for now, as we could easily add the column for existing rows using some SQL and then just use Python to populate that for filings processed in the future.

Also, if we find a single nine-digit CUSIP, then that might be the one to keep, ditching all the rest.

I think it might make sense to evaluate rules using the complete data we have in PostgreSQL. For example, we can easily evaluate what we'd lose if we deleted all CUSIPs except the valid nine-digit ones from filings that have those.

(Note that you write if full_df.shape[0] and len(full_df['cusip'].unique()) == 1: in Python; no brackets.)

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik_test")
filings <- tbl(pg, "filings")
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))

multi_cusips <- 
    cusip_cik %>% 
    mutate(cusip6 = substr(cusip, 1, 6)) %>%
    distinct(file_name, cik, cusip6) %>%
    group_by(file_name) %>% 
    filter(n() > 1) %>%
    ungroup() %>%
    compute()

single_cusips <- 
    cusip_cik %>% 
    mutate(cusip6 = substr(cusip, 1, 6)) %>%
    distinct(file_name, cik, cusip6) %>%
    group_by(file_name) %>% 
    filter(n() == 1) %>%
    ungroup() %>%
    compute()

distinct_multi <- 
    multi_cusips %>% 
    group_by(cik, cusip6) %>%
    summarize(count = n()) %>%
    filter(count > 10) %>%
    ungroup() %>%
    select(-count) %>%
    compute()

distinct_single <- 
    single_cusips %>% 
    group_by(cik, cusip6) %>%
    summarize(count = n()) %>%
    filter(count > 10) %>%
    ungroup() %>%
    select(-count) %>%
    compute()

distinct_single %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 21338

distinct_multi %>% 
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 418

distinct_multi %>% 
    anti_join(distinct_single) %>%
    arrange(cik)
#> Joining, by = c("cik", "cusip6")
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: cik
#>      cik cusip6
#>    <int> <chr> 
#>  1 30697 929903
#>  2 37643 929903
#>  3 39911 375175
#>  4 41296 745031
#>  5 43196 002535
#>  6 49600 004239
#>  7 52848 769667
#>  8 52848 929903
#>  9 60064 125891
#> 10 63330 577346
#> # … with more rows

distinct_multi %>% 
    anti_join(distinct_single) %>%
    count()
#> Joining, by = c("cik", "cusip6")
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 202

# Clearly invalid CUSIPs: SEECOVER PAGE2OF4 0PARVALUE
# What about some of the others?
filings %>% 
    filter(cik == 3545) %>% 
    count(company_name)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   company_name n      
#>   <chr>        <int64>
#> 1 ALICO INC    1494   
#> 2 ALICO, INC.    32

stocknames %>% 
    filter(substr(ncusip, 1, 6) == "016301") %>%
    select(permno, comnam)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 2 variables: permno <int>, comnam <chr>

filings %>% 
    filter(cik == 6715L) %>% 
    count(company_name)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   company_name   n      
#>   <chr>          <int64>
#> 1 ANTHONY C R CO 22

stocknames %>% 
    filter(substr(ncusip, 1, 6) == "036776") %>%
    select(permno, comnam)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   permno comnam        
#>    <int> <chr>         
#> 1  83854 C R ANTHONY CO

filings %>%
    filter(cik ==  16496L) %>%
    count(company_name)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   company_name n      
#>   <chr>        <int64>
#> 1 CALPROP CORP 171

stocknames %>% 
    filter(substr(ncusip, 1, 6) == "131521") %>%
    select(permno, comnam)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # … with 2 variables: permno <int>, comnam <chr>

cusip_cik %>% 
    filter(cik == 6715L) %>% 
    group_by(file_name) %>% 
    summarize(cusips = sql("array_agg(DISTINCT cusip ORDER BY cusip)")) %>% 
    print(n=Inf)
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name                                  cusips           
#>    <chr>                                      <pq__text>       
#>  1 edgar/data/6715/0000890566-97-000407.txt   {03677610}       
#>  2 edgar/data/6715/0000890587-97-000026.txt   {036776102}      
#>  3 edgar/data/6715/0000912057-97-008283.txt   {036776102}      
#>  4 edgar/data/6715/0000950109-96-004342.txt   {036776,PAGE1OF} 
#>  5 edgar/data/6715/0000950131-96-003403.txt   {036776,PAGE2OF4}
#>  6 edgar/data/6715/0000950131-96-006325.txt   {036776,PAGE2OF4}
#>  7 edgar/data/6715/0000950131-96-006482.txt   {036776102}      
#>  8 edgar/data/6715/0000950131-97-001890.txt   {036776102}      
#>  9 edgar/data/6715/0000950131-97-002663.txt   {036776102}      
#> 10 edgar/data/6715/0000950131-97-003246.txt   {036776102}      
#> 11 edgar/data/6885/0000890566-97-000407.txt   {03677610}       
#> 12 edgar/data/884099/0000950109-96-004342.txt {036776,PAGE1OF} 
#> 13 edgar/data/884099/0000950131-96-003403.txt {036776,PAGE2OF4}
#> 14 edgar/data/884099/0000950131-96-006325.txt {036776,PAGE2OF4}
#> 15 edgar/data/884099/0000950131-96-006482.txt {036776102}      
#> 16 edgar/data/884099/0000950131-97-001890.txt {036776102}      
#> 17 edgar/data/884099/0000950131-97-002663.txt {036776102}      
#> 18 edgar/data/884099/0000950131-97-003246.txt {036776102}      
#> 19 edgar/data/903954/0000912057-97-008283.txt {036776102}      
#> 20 edgar/data/904197/0000890587-97-000026.txt {036776102}

^{Created on 2020-06-19 by the reprex package (v0.3.0)}

bdcallen commented 4 years ago

@iangow

Many of the cases with "multiple CUSIPs" are just different versions of the same CUSIP (e.g., a six-digit partial match, and a nine-digit complete match). In these cases, we should keep the eight- or nine-digit match for a filing and then ditch the six-digit one

I have been looking at the cases lately with the following redefinitions

only_mult_cik_df <- distinct_multi %>% anti_join(distinct_single, by = 'cik') %>% distinct(cik) %>% collect()

single_cusip6_df <- multi_cusips_df %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
    distinct(file_name, cusip6) %>% group_by(file_name) %>%
    summarise(num_cusip6 = n()) %>% filter(num_cusip6 == 1) %>%
    inner_join(multi_cusips_df) %>% collect()

multi_cusips6_df <- multi_cusips_df %>% mutate(cusip6 = substr(cusip, 1, 6)) %>%
    distinct(file_name, cusip6) %>% group_by(file_name) %>%
    summarise(num_cusip6 = n()) %>% filter(num_cusip6 > 1) %>%
    inner_join(multi_cusips_df) %>% mutate(nchar = nchar(cusip)) %>% collect()

multi_cusips6_df <- multi_cusips6_df %>% mutate(cusip6 = substr(cusip, 1, 6))

single_cusip6_df contains the cases with just those with a single cusip6, and it's these cases that should correspond to those mentioned in that comment. If we keep these cases, we get matches for 52 of the ciks in the set only found amongst filings with multiple cusips.

> single_cusip6_df %>% inner_join(only_mult_cik_df, by = 'cik') %>% distinct(cik) %>% count()
# A tibble: 1 x 1
      n
  <int>
1    52

I don't think it would be hard to amend the code to handle these cases. Deciphering those in multi_cusips6_df, on the other hand, would probably be rather difficult. I think we should perhaps flag these cases, as you have mentioned elsewhere. There are the obviously troublesome cases where the number of cusip6's is equal to the number of cusips

> multi_cusips6_df %>% filter(num_cusip6 == num_cusips)
# A tibble: 58,663 x 10
   file_name                                   num_cusip6 num_cusips cusip     check_digit     cik company_name           formats nchar cusip6
   <chr>                                            <int>      <dbl> <chr>           <int>   <int> <chr>                  <chr>   <int> <chr> 
 1 edgar/data/1000015/0000912057-01-004618.txt          3          3 591002100           0 1000015 META GROUP INC         AC          9 591002
 2 edgar/data/1000015/0000912057-01-004618.txt          3          3 M81876710           2 1000015 META GROUP INC         C           9 M81876
 3 edgar/data/1000015/0000912057-01-004618.txt          3          3 M81867109           9 1000015 META GROUP INC         C           9 M81867
 4 edgar/data/1000015/0000912057-01-004619.txt          2          2 591002100           0 1000015 META GROUP INC         AC          9 591002
 5 edgar/data/1000015/0000912057-01-004619.txt          2          2 M81867109           9 1000015 META GROUP INC         C           9 M81867
 6 edgar/data/1000015/0000943152-97-000348.txt          2          2 59100210            0 1000015 META GROUP INC         ABC         8 591002
 7 edgar/data/1000015/0000943152-97-000348.txt          2          2 00431L10            2 1000015 META GROUP INC         C           8 00431L
 8 edgar/data/1000045/0000897069-01-500505.txt          2          2 65373J20            9 1000045 NICHOLAS FINANCIAL INC ABD         8 65373J
 9 edgar/data/1000045/0000897069-01-500505.txt          2          2 6573J20             6 1000045 NICHOLAS FINANCIAL INC C           7 6573J2
10 edgar/data/1000045/0000897069-01-500506.txt          2          2 65373J20            9 1000045 NICHOLAS FINANCIAL INC ABD         8 65373J

but even cases where the number of cusip6's is less seem to be messy, at least in the first filing mentioned

> multi_cusips6_df %>% filter(num_cusip6 < num_cusips)
# A tibble: 893 x 10
   file_name                                   num_cusip6 num_cusips cusip     check_digit     cik company_name                       formats nchar cusip6
   <chr>                                            <int>      <dbl> <chr>           <int>   <int> <chr>                              <chr>   <int> <chr> 
 1 edgar/data/1001606/0000950144-99-011323.txt          3          5 09580105            6 1001606 "BLOUNT INTERNATIONAL INC"         C           8 095801
 2 edgar/data/1001606/0000950144-99-011323.txt          3          5 095177101           1 1001606 "BLOUNT INTERNATIONAL INC"         C           9 095177
 3 edgar/data/1001606/0000950144-99-011323.txt          3          5 095177200           0 1001606 "BLOUNT INTERNATIONAL INC"         C           9 095177
 4 edgar/data/1001606/0000950144-99-011323.txt          3          5 095180105           5 1001606 "BLOUNT INTERNATIONAL INC"         C           9 095180
 5 edgar/data/1001606/0000950144-99-011323.txt          3          5 095177100           1 1001606 "BLOUNT INTERNATIONAL INC"         C           9 095177
 6 edgar/data/1006195/0000950153-96-000980.txt          2          3 402803118           8 1006195 "GUMTECH INTERNATIONAL INC \\UT\\" AC          9 402803
 7 edgar/data/1006195/0000950153-96-000980.txt          2          3 402803100           0 1006195 "GUMTECH INTERNATIONAL INC \\UT\\" A           9 402803
 8 edgar/data/1006195/0000950153-96-000980.txt          2          3 402603100           4 1006195 "GUMTECH INTERNATIONAL INC \\UT\\" C           9 402603
 9 edgar/data/1011632/0001042046-97-000004.txt          2          3 714006103           3 1011632 "PERITUS SOFTWARE SERVICES INC"    A           9 714006
10 edgar/data/1011632/0001042046-97-000004.txt          2          3 469858401           1 1011632 "PERITUS SOFTWARE SERVICES INC"    C           9 469858
# … with 883 more rows

iangow commented 4 years ago

@bdcallen

Many of the cases with "multiple CUSIPs" are just different versions of the same CUSIP (e.g., a six-digit partial match, and a nine-digit complete match). In these cases, we should keep the eight- or nine-digit match for a filing and then ditch the six-digit one

I have been looking at the cases lately with the following redefinitions

So I updated the code examples above so that they refer to cusip_cik_test which (for now) just implements the resolution of #82 (see #93 for more details). This reduces the number of issues quite a bit.

I will hold this issue open for now, but I think it should be resolved after #86, etc.