Closed iangow closed 4 years ago
@bdcallen
So I looked into the first one. It seems that the first eight characters provide a valid CIK-CUSIP match that should be on the new table.
After digging into this one, I changed the code above to focus on common CIK-CUSIP pairs not found in the new table. The one below is not there.
(Note that the earlier version of the code, which you can get by looking at history in the comment above, probably does identify problems, as these valid CUSIP-CIK pairs that we're missing for certain filings. But for now, I think we should focus on cases where we're losing pairs.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
filings <- tbl(pg, "filings")
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
stocknames %>%
filter(ncusip == "48273010") %>%
select(comnam)
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> comnam
#> <chr>
#> 1 K TRON INTL INC
filings %>%
filter(cik==20) %>%
count(company_name)
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> company_name n
#> <chr> <int64>
#> 1 K TRON INTERNATIONAL INC 401
Created on 2020-04-12 by the reprex package (v0.3.0)
Looking at the list in the first column, it seems that the first two are valid except for the issue of case (the CUSIPS should be 29874M103
and 81631Y102
respectively). Perhaps the new code picks these up and fixes the case? If so, fine.
The next three appear to be duds. But the one after that seems to be a valid CUSIP-CIK combination. Not sure why the new code isn't picking it up (though, since it's there in nine-digit form, perhaps this isn't too great a concern).
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
filings <- tbl(pg, "filings")
stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))
cusip_cik <- tbl(pg, "cusip_cik")
stocknames %>%
filter(ncusip == "00490710") %>%
select(comnam)
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> comnam
#> <chr>
#> 1 ACORN VENTURE CAPITAL CORP
filings %>%
filter(cik==737243) %>%
count(company_name)
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> company_name n
#> <chr> <int64>
#> 1 ACORN HOLDING CORP 87
#> 2 ACORN VENTURE CAPITAL CORP 58
#> 3 Valentec Systems, Inc 38
cusip_cik %>%
filter(cik==737243) %>%
count(cusip)
#> # Source: lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> cusip n
#> <chr> <int64>
#> 1 004853107 10
#> 2 <NA> 64
Created on 2020-04-12 by the reprex package (v0.3.0)
@bdcallen Note that ncusip
on crsp.stocknames
is only the first eight digits of the CUSIP (i.e., no check digit).
@iangow
> case_matches <- old_matches_lost %>% mutate(cusip = toupper(cusip)) %>% inner_join(cusip_cik) %>% collect()
> case_matches %>% distinct(cik, cusip)
# A tibble: 57 x 2
cik cusip
<int> <chr>
1 1015483 26784F103
2 1095330 87155S108
3 919794 92672P108
4 914478 79604V105
5 1054836 M47095100
6 920990 59163F105
7 1016838 M81865103
8 913293 01877H100
9 1057083 69325Q105
10 906551 74726X105
# … with 47 more rows
So 57 of the 415 cases are cases where the new code has missed the lower case letters but retained the upper case letters. I seem to have found the offending line
cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,9})'
So my program only looked for cases where the letters where in upper case. I could fix this by either looking for lower case letters as well, or just converting the text searched to upper case.
@iangow It seems that some old filings may be missing. I just did this for one of the cik, cusip pairs on the list
crsp=# SELECT * FROM edgar.cusip_cik_old
crsp-# WHERE cik = 317438;
file_name | cusip | cik | company_name | format
---------------------------------------------+-----------+--------+----------------+--------
edgar/data/1078649/0001050502-02-000601.txt | | 317438 | CLX ENERGY INC |
edgar/data/1079197/0001076636-99-000032.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079199/0001076636-99-000027.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000019.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000020.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000021.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000022.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000023.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000024.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000026.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000027.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000032.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001076636-99-000033.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079188/0001076636-99-000026.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001050502-02-000601.txt | | 317438 | CLX ENERGY INC |
edgar/data/1079186/0001076636-99-000024.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079198/0001076636-99-000020.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079190/0001076636-99-000023.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079193/0001076636-99-000033.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079196/0001076636-99-000019.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079195/0001076636-99-000021.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/1079187/0001076636-99-000022.txt | 125649202 | 317438 | CLX ENERGY INC | C
edgar/data/317438/0001050502-02-000601.txt | | 317438 | CLX ENERGY INC | C
edgar/data/1078649/0001050502-02-000601.txt | | 317438 | CLX ENERGY INC | C
(24 rows)
crsp=# ;
crsp=# SELECT * FROM edgar.cusip_cik
crsp-# WHERE file_name = 'edgar/data/317438/0001076636-99-000021.txt';
file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)
crsp=# SELECT * FROM edgar.filings
WHERE file_name = 'edgar/data/317438/0001076636-99-000021.txt';
company_name | form_type | cik | date_filed | file_name
--------------+-----------+-----+------------+-----------
(0 rows)
So where have the filings gone, if their no longer in the bulk extract?
@iangow
> old_filings_lost <- cusip_cik_old %>% anti_join(cusip_cik, by = 'file_name')
> old_filings_lost %>% distinct(cik, cusip) %>% inner_join(old_matches_lost)
Joining, by = c("cik", "cusip")
# Source: lazy query [?? x 3]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
cik cusip n
<int> <chr> <dbl>
1 20639 023164710 18
2 25890 228 16
3 57201 22406P10 24
4 59593 5351711 42
5 81061 None 22
6 89140 817732321 34
7 105936 95751D310 12
8 108516 9818111 21
9 278128 NONE 14
10 313499 NONE 12
# … with more rows
> old_filings_lost %>% distinct(cik, cusip) %>% inner_join(old_matches_lost) %>% count()
Joining, by = c("cik", "cusip")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 93
> filings <- tbl(pg, sql("SELECT * FROM edgar.filings"))
> old_filings_lost %>% inner_join(filings, by = 'file_name')
# Source: lazy query [?? x 9]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
file_name cusip cik.x company_name.x format company_name.y form_type cik.y date_filed
<chr> <chr> <int> <chr> <chr> <chr> <chr> <int> <date>
1 edgar/data/1332905/0001172661-17-000448.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C RMB Capital Management, LLC SC 13G 1332905 2017-02-13
2 edgar/data/1094742/0001172661-17-000448.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C MAINSTREET BANKSHARES INC SC 13G 1094742 2017-02-13
3 edgar/data/1332905/0001172661-18-000468.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C RMB Capital Management, LLC SC 13G/A 1332905 2018-02-13
4 edgar/data/1094742/0001172661-18-000468.txt 56064Y100 1094742 MAINSTREET BANKSHARES INC C MAINSTREET BANKSHARES INC SC 13G/A 1094742 2018-02-13
> old_filings_lost %>% inner_join(filings, by = 'file_name') %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 4
> old_filings_lost %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 39984
So the vast majority of filings in cusip_cik_old
which are missing in cusip_cik
are in fact no longer in filings
either.
@iangow So the only four filings still in filings account for one cusip, cik pair (cik.x is the cik in cusip_cik_old). So 92 of the 415 pairs are from missing filings.
@iangow
> old_matches_lost %>% filter(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip))) %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% count()
Joining, by = c("cik", "cusip")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 82
There seem to be a substantion number of cases with stopwords, or useless cusips (minimum number of characters should be 6, because that is the minimum needed from a cusip to specify an issuer).
@iangow matches by case adds another 42
> case_matches <- old_matches_lost %>% mutate(cusip = toupper(cusip)) %>% inner_join(cusip_cik)
Joining, by = c("cik", "cusip")
> case_matches %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip)))) %>% count()
Joining, by = c("cik", "cusip")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 3143
> case_matches %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common)', tolower(cusip)))) %>% distinct(cik, cusip) %>% count()
Joining, by = c("cik", "cusip")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 42
So the problematic cases come to around 93 + 42 + 82 = 217, which is around half the cases in old_matches_lost
@iangow
> outstanding_df <- old_matches_lost %>% anti_join(case_matches %>% distinct(cik, cusip)) %>% anti_join(old_filings_lost %>% distinct(cik, cusip)) %>% filter(!(nchar(cusip) < 6 | grepl('(not|none|com|december|common|schedule|inapplicable)', tolower(cusip)))) %>% distinct(cik, cusip)
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% cusip.x) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 69
So for an additional 69 cases, the cusip in the old table is contained inside a cusip of the new table.
@iangow
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 24
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y & cusip.y %~% cusip.x) %>% distinct(cik, cusip.x) %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 0
and an additional 24 in which the new cusip is containd in the old one. I should check these
> outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x, cusip.y)
# Source: lazy query [?? x 3]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
cik cusip.x cusip.y
<int> <chr> <chr>
1 65297 05873J101 05873J10
2 808015 59484E1 59484E
3 823546 088861L1 088861L
4 831532 450707101 45070710
5 852952 9196105 919610
6 1005757 126341091 12634109
7 1009356 795435100 79543510
8 1010418 896411610 89641161
9 1012477 03384881 338488
10 1015172 05462R10 05462R1
# … with more rows
> new_has_old <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% cusip.x) %>% distinct(cik, cusip.x)
> old_has_new <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.x %~% cusip.y) %>% distinct(cik, cusip.x)
> old_has_new %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 24
> old_has_new %>% inner_join(new_has_old)
Joining, by = c("cik", "cusip.x")
# Source: lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
cik cusip.x
<int> <chr>
1 808015 59484E1
2 1012477 03384881
3 1015172 05462R10
4 1034669 3377562
5 1110650 G393001
6 1119744 M975401
7 1282977 3952591
8 1436351 74623L1
> old_has_new %>% inner_join(new_has_old) %>% count()
Joining, by = c("cik", "cusip.x")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 8
So a third of the old_has_new
cases match other cusips in new_has_old
So, I expanded the definition for new_has_old
and old_has_new
, I noticed I didn't account for the cases where the the old cusip had letters in lowercase. Thus I did the following
> new_has_old <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(cusip.y %~% toupper(cusip.x)) %>% distinct(cik, cusip.x)
> old_has_new <- outstanding_df %>% inner_join(cusip_cik, by = 'cik') %>% filter(toupper(cusip.x) %~% cusip.y) %>% distinct(cik, cusip.x)
> old_has_new %>% anti_join(new_has_old)
Joining, by = c("cik", "cusip.x")
# Source: lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
cik cusip.x
<int> <chr>
1 65297 05873J101
2 823546 088861L1
3 831532 450707101
4 852952 9196105
5 1005757 126341091
6 1009356 795435100
7 1010418 896411610
8 1038186 09255C403
9 1089590 645378H10
10 1113679 873942A10
# … with more rows
> new_has_old %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 124
> old_has_new %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 77
> outstanding_df %>% count()
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 237
> old_has_new %>% inner_join(new_has_old) %>% count()
Joining, by = c("cik", "cusip.x")
# Source: lazy query [?? x 1]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
n
<dbl>
1 63
> old_has_new %>% inner_join(new_has_old)
Joining, by = c("cik", "cusip.x")
# Source: lazy query [?? x 2]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
cik cusip.x
<int> <chr>
1 724606 69840w108
2 745287 45881k104
3 749647 15117n305
4 770944 05873k108
5 808015 59484E1
6 811640 20602t106
7 812890 58445p105
8 816761 88076w103
9 826326 03878k207
10 860510 29874m103
# … with more rows
> old_has_new %>% inner_join(new_has_old) %>% collect()
Joining, by = c("cik", "cusip.x")
# A tibble: 63 x 2
cik cusip.x
* <int> <chr>
1 724606 69840w108
2 745287 45881k104
3 749647 15117n305
4 770944 05873k108
5 808015 59484E1
6 811640 20602t106
7 812890 58445p105
8 816761 88076w103
9 826326 03878k207
10 860510 29874m103
# … with 53 more rows
> old_has_new %>% anti_join(new_has_old) %>% collect()
Joining, by = c("cik", "cusip.x")
# A tibble: 14 x 2
cik cusip.x
* <int> <chr>
1 65297 05873J101
2 823546 088861L1
3 831532 450707101
4 852952 9196105
5 1005757 126341091
6 1009356 795435100
7 1010418 896411610
8 1038186 09255C403
9 1089590 645378H10
10 1113679 873942A10
11 1125259 143658300
12 1348259 BMG9319H1
13 1488775 1846922
14 1488775 1846923
I then looked for the cusip BMG9319H1
as an example
crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip = 'BMG9319H1';
file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)
crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip = 'BMG9319H1';
file_name | cusip | cik | company_name | format
---------------------------------------------+-----------+---------+----------------------+--------
edgar/data/70858/0000903423-09-000136.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000947871-08-000163.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0001297402-08-000008.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/65100/0000947871-08-000163.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000898286-08-000043.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000903423-09-000136.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000903423-10-000083.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000903423-11-000086.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/1348259/0000903423-12-000092.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/70858/0000903423-10-000083.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/70858/0000903423-11-000086.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/70858/0000903423-12-000092.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/898286/0001297402-08-000008.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
edgar/data/898286/0000898286-08-000043.txt | BMG9319H1 | 1348259 | VALIDUS HOLDINGS LTD | C
(14 rows)
crsp=# SELECT * FROM edgar.filings
crsp-# WHERE file_name = '
crsp'# crsp=#
crsp'# ;
crsp'# ';
company_name | form_type | cik | date_filed | file_name
--------------+-----------+-----+------------+-----------
(0 rows)
crsp=# SELECT * FROM edgar.filings
WHERE file_name = 'edgar/data/1348259/0000903423-09-000136.txt';
company_name | form_type | cik | date_filed | file_name
----------------------+-----------+---------+------------+---------------------------------------------
VALIDUS HOLDINGS LTD | SC 13G/A | 1348259 | 2009-02-18 | edgar/data/1348259/0000903423-09-000136.txt
(1 row)
crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/1348259/0000903423-09-000136.txt';
file_name | cusip | check_digit | cik | company_name | formats
---------------------------------------------+-----------+-------------+---------+----------------------+---------
edgar/data/1348259/0000903423-09-000136.txt | 9319H1025 | 1 | 1348259 | VALIDUS HOLDINGS LTD | AB
(1 row)
Here's the filing at the end . So this is an interesting case where a filer has put in a cusip with 12 digits! The old table seems to have caught the first 9 digts, and the new table the last 9.
@iangow Here's an example for the first row of the 14 cases from old_has_new which are not in new_has_old, for the pair cik = 65297, cusip = 05873J101.
crsp=# SELECT * FROM edgar.cusip_cik
crsp-# WHERE cusip = '05873J101';
file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)
crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip = '05873J101';
file_name | cusip | cik | company_name | format
--------------------------------------------+-----------+-------+-----------------------+--------
edgar/data/65297/0000315066-94-001914.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-94-001509.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-94-001509.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-94-001914.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-95-002519.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-95-001152.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-95-002396.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-95-002399.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-95-002399.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-95-001152.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-95-002396.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-96-000094.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-96-000094.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-96-001295.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/65297/0000315066-96-001305.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-95-002519.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-96-001295.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
edgar/data/315066/0000315066-96-001305.txt | 05873J101 | 65297 | BALLYS GRAND INC /DE/ | C
(18 rows)
crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/65297/0000315066-95-002399.txt';
file_name | cusip | check_digit | cik | company_name | formats
-------------------------------------------+-------+-------------+-------+-----------------------+---------
edgar/data/65297/0000315066-95-002399.txt | | | 65297 | BALLYS GRAND INC /DE/ |
(1 row)
For this case, my python code seems to have missed it because the relevant line has Cusip
instead of CUSIP
, since cusip_hdr
in my code is
cusip_hdr = r'CUSIP\s+(?:No\.|NO\.|#|Number|NUMBER):?'
However, cusip_hdr
is not that different in the original Perl code
# Regular expressions
$cusip_hdr = 'CUSIP\s+(?:No\.|#|Number):?';
Is string matching via regular expressions case insensitive in Perl? Or have I missed a line where the text was converted to upper or lower case?
@iangow Here's another example, this time for cik = 831532, cusip = 450707101.
crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip = '450707101';
file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)
crsp=# SELECT * FROM edgar.cusip_cik
WHERE cusip ~ '450707101';
file_name | cusip | check_digit | cik | company_name | formats
-----------+-------+-------------+-----+--------------+---------
(0 rows)
crsp=# SELECT * FROM edgar.cusip_cik_old
WHERE cusip = '450707101';
file_name | cusip | cik | company_name | format
---------------------------------------------+-----------+--------+-------------------------+--------
edgar/data/831532/0000950124-97-005515.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-97-006027.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-97-006224.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-97-006313.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-004799.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-97-004799.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-99-004018.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-97-002109.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-002109.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-005515.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-006027.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-006224.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-97-006313.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-98-007145.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-98-002563.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/831532/0000950124-98-003097.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-98-002563.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-98-003097.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-99-004018.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
edgar/data/1037148/0000950124-98-007145.txt | 450707101 | 831532 | ATRIX INTERNATIONAL INC | C
(20 rows)
crsp=# SELECT * FROM edgar.cusip_cik
WHERE file_name = 'edgar/data/1037148/0000950124-97-006313.txt';
file_name | cusip | check_digit | cik | company_name | formats
---------------------------------------------+-----------+-------------+--------+-------------------------+---------
edgar/data/1037148/0000950124-97-006313.txt | 04962P102 | 2 | 831532 | ATRIX INTERNATIONAL INC | A
edgar/data/1037148/0000950124-97-006313.txt | 45070710 | 4 | 831532 | ATRIX INTERNATIONAL INC | C
(2 rows)
So this example seems to have two cusips, with the latter one being the one of interest with regards to the matching. The latter cusip appears at the start of the cover page and it seems to be badly formatted, with the issue number (the 7th and 8th digit, 10) being duplicated. Hence, the '1' at the end of the old cusip. The new table contains the first 8 digits of the latter cusip, and with the check_digit column, it can be inferred that the proper candidate is 450707104
.
Both the cusips found for this filing appear in stocknames
, so as an aside it's another example of multiple cusips with one of them not being the proper cusip for the stock in question
> stocknames %>% filter(ncusip == "45070710")
# Source: lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 79706 12460 1993-10-06 1998-06-16 68272W10 45070710 IVIP I V I PUBLISHING INC 3 3 7370 11 NA 1993-10-29 2000-09-29 2
> stocknames %>% filter(ncusip == "04962P10")
# Source: lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 76883 10884 1991-04-10 1997-12-15 04962P30 04962P10 ATXI ATRIX INTERNATIONAL INC 3 3 3420 11 NA 1991-04-30 1999-06-30 2
I then looked for the cusip BMG9319H1 as an example
This one appears to have come from an ISIN
Futhermore, the ISIN is derived from the cusip, with the first two digits standing for the country, the next 9 being the actual cusip, then the last being an additional check digit. Hence I did
> stocknames %>% filter(cusip == "G9319H10")
# Source: lazy query [?? x 16]
# Database: postgres 11.0.5 [bdcallen@/var/run/postgresql:5432/crsp]
permno permco namedt nameenddt cusip ncusip ticker comnam hexcd exchcd siccd shrcd shrcls st_date end_date namedum
<int> <int> <date> <date> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <date> <date> <dbl>
1 92191 52537 2007-07-25 2018-07-17 G9319H10 G9319H10 VR VALIDUS HOLDINGS LTD 1 1 6331 12 NA 2007-07-31 2018-07-31 2
So it appears the correct cusip for this case is in fact G9319H102
. In fact this cusip appears under the appropriate cik (1348259) in cusip_cik
crsp=# SELECT COUNT(*) FROM edgar.cusip_cik
WHERE cusip = 'G9319H102'
crsp-# AND cik = 1348259;
count
-------
74
(1 row)
OK. Here's a summary of what I think you have discovered. Let me know if I missed anything. It may make sense to make a new issue for each of the following and to implement them in some order (e.g., "easy to do and important to do" would come first); see below for some more ideas.
Cusip
instead of CUSIP
. I would just add Cusip
back as an explicit alternative. Perl regular expressions are (of course) case-sensitive, but there may be a point in the code where I turned this off. (In fact, I think the i
flag at the end of ($lines =~ /($cusip_fmt)\s+(?:[_-]{9,})?\s*\(CUSIP Number\)/si)
has this effect. It's been a while, so I don't remember. We may want to be more judicious than simply eliminating case-sensitivity, though there's no doubt that CuSiP NUmbER
can only refer to a CUSIP!)[a-z]
. I think we should do what you suggest and look for those, but I think we should convert them to upper case before storing them if that's what CUSIPs should be.edgar.filings
. These surprise me, as I thought entries were not deleted from the underlying data. In this case, we have nothing in edgar.filings
even though the filing is still there. I think there's not much to be done about these for this issue. (It may warrant separate investigation just to understand the scale of the issue, however.)I think it may be helpful to have a table that contains a sample of filings for testing. This need not be a random sample (e.g., you could oversample cases where we seem to have issues), but it might be good to have some randomly selected filings in there. I think 1000 or 10000 fillings would probably be enough. The idea would be that you could change the code, then run it on the test sample and compare with the results of the prior run of code before committing the code.
My assumption is that when we change the code we really need to go run it on the whole set of filings again so that we have some degree of consistency. My assumption is that doing so is expensive, so we want to accumulate several changes to the code before we go do that.
@bdcallen If we have captured all the issues raised here elsewhere (even if just by making a new issue), then we should close this one. If I have correctly summarized the above here and we have issues for each item I raised there, then I think this can be closed. (I'll close this assume that this is true.)
@bdcallen As discussed in #63, we really should've nailed down a rough replication of the existing technology before going into other issues.
My concern is that we have code that doesn't work as well as the old code along certain dimensions and that code might be costly (in terms of time) to fix as it's got add bells and whistles now.
Below I have put some data on CUSIP-CIK matches found in the old table, but lost in the new table. I think it would be good to see if you can quickly fix the new code to recover these (assuming they are good). You should be able to focus on a handful of them and then zero in on the specific filings where the old code recovered them, etc.
Please put this at a high priority, as I have a number of things that rely on this CUSIP-CIK code and things are kind of in limbo with that right now.
Created on 2020-04-22 by the reprex package (v0.3.0)