Closed iangow closed 4 years ago
@iangow If you look at the actual filing, the cusip is literally given as 5021600-10-4
, which is 10 digits long. Comparing with the correct cusip 502160104
, someone has added an extra, erroneous 0 to the end of the cusip6 part. Having looked at many SC 13D's and SC 13G's, typos like this are quite common.
Seems that
check_digit
might help flag these issues.
Yes, that's indeed why I included this column, to find erroneous 9 digit cusips, as well as provide candidates for cusips with only 8 characters.
@bdcallen
The way we should've handled the task of converting the CUSIP-CIK code from Perl to Python is as follows:
I think instead we sort of started (1), then jumped to (6). And even here we have incomplete code (e.g., we have no documentation, we still have the old code, and it's not obvious how to run the new code).
We keep opening loops without closing old ones and turning small tasks (1-5 above) into big ones. I have started trying to organize the issues associated with this larger task into a project. I think we should come up with a plan to finish this project.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik_old <- tbl(pg, "cusip_cik_old")
old_matches_lost <-
cusip_cik_old %>%
filter(nchar(cusip) %in% c(8L, 9L)) %>%
count(cik, cusip) %>%
filter(n > 10) %>%
ungroup() %>%
anti_join(cusip_cik, by = c("cusip", "cik")) %>%
compute()
old_matches_lost %>%
arrange(desc(n)) %>%
print(n=30)
#> # Source: table<dbplyr_004> [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#> cik cusip n
#> <int> <chr> <int64>
#> 1 860510 29874m103 84
#> 2 1031235 81631y102 78
#> 3 1001258 046222010 66
#> 4 14693 01156371 61
#> 5 737243 004907101 60
#> 6 789933 629579200 60
#> 7 889430 09247k109 60
#> 8 914478 79604v105 56
#> 9 1267482 81663N20 55
#> 10 913756 068750V10 48
#> 11 880460 71376c100 48
#> 12 60128 54141010 46
#> 13 1172494 000002255 45
#> 14 1026734 261384101 44
#> 15 933955 87305U10 40
#> 16 800083 83568g104 39
#> 17 1572236 58436r201 34
#> 18 60714 50216001 34
#> 19 89140 817732321 34
#> 20 728478 22051J308 34
#> 21 917126 747456P1 32
#> 22 749254 876263610 30
#> 23 874761 001300H10 30
#> 24 912240 36127110 30
#> 25 1334933 000133493 28
#> 26 1008597 000727025 28
#> 27 1063167 000106316 28
#> 28 869570 78462k102 28
#> 29 740663 December3 28
#> 30 1057083 69325q105 27
#> # … with more rows
old_matches_lost %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 256
new_matches_gained <-
cusip_cik %>%
filter(nchar(cusip) %in% c(8L, 9L)) %>%
count(cik, cusip) %>%
filter(n > 10) %>%
ungroup() %>%
anti_join(cusip_cik_old, by = c("cusip", "cik")) %>%
compute()
new_matches_gained %>%
arrange(desc(n)) %>%
print(n=30)
#> # Source: table<dbplyr_008> [?? x 3]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#> cik cusip n
#> <int> <chr> <int64>
#> 1 1209466 929903102 324
#> 2 1100663 929903102 184
#> 3 1064642 929903102 114
#> 4 789933 62957920 90
#> 5 768710 75952R100 88
#> 6 1350487 929903102 88
#> 7 789933 295792002 88
#> 8 811419 02355926 86
#> 9 52848 929903102 84
#> 10 770034 074223065 78
#> 11 1001258 462220109 64
#> 12 1034592 68212U104 62
#> 13 1475115 29975E109 60
#> 14 1439397 66702Q203 60
#> 15 1085601 94856P102 59
#> 16 790272 63935Q100 58
#> 17 879354 29425910G 52
#> 18 749254 762636105 52
#> 19 930667 929903102 50
#> 20 1172494 000022551 48
#> 21 723574 45244M102 48
#> 22 908837 62874M104 47
#> 23 1064641 929903102 46
#> 24 815097 14365830 46
#> 25 74273 680293305 44
#> 26 1125259 14365830 44
#> 27 1378872 929903102 42
#> 28 1585608 47010C300 42
#> 29 908837 45865E108 42
#> 30 1520358 57660106 40
#> # … with more rows
new_matches_gained %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 603
Created on 2020-04-12 by the reprex package (v0.3.0)
Closing in favour of #76.
Issue below is caused by CUSIP formatted as
CUSIP NO. 5021600-10-4
.Not sure how common this is, but bad CUSIPs for this CIK seem common. Correct CUSIP is
502160104
. Seems thatcheck_digit
might help flag these issues. One solution might be to check for this format explicitly.Created on 2019-12-05 by the reprex package (v0.3.0)