mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Compare cusip_cik new and old #63

Closed iangow closed 4 years ago

iangow commented 4 years ago

Issue below is caused by CUSIP formatted as CUSIP NO. 5021600-10-4.

Not sure how common this is, but bad CUSIPs for this CIK seem common. Correct CUSIP is 502160104. Seems that check_digit might help flag these issues. One solution might be to check for this format explicitly.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik_old <- tbl(pg, "cusip_cik_old")

cusip_cik_old %>%
  filter(!is.na(cusip)) %>%
  anti_join(cusip_cik, by = c("file_name", "cusip", "cik")) %>%
  filter(file_name=="edgar/data/60714/0000060714-95-000017.txt")
#> # Source:   lazy query [?? x 5]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_name                          cusip      cik company_name     format
#>   <chr>                              <chr>    <int> <chr>            <chr> 
#> 1 edgar/data/60714/0000060714-95-00… 50216001 60714 LSB INDUSTRIES … C

cusip_cik %>%
  filter(file_name=="edgar/data/60714/0000060714-95-000017.txt")
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   file_name                cusip   check_digit   cik company_name   formats
#>   <chr>                    <chr>         <int> <int> <chr>          <chr>  
#> 1 edgar/data/60714/000006… 021600…           1 60714 LSB INDUSTRIE… A

stocknames <- tbl(pg, sql("SELECT * FROM crsp.stocknames"))

stocknames %>% 
  filter(ncusip == '50216010') %>% 
  select(ncusip, comnam) %>% 
  distinct()
#> # Source:   lazy query [?? x 2]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   ncusip   comnam              
#>   <chr>    <chr>               
#> 1 50216010 L S B INDUSTRIES INC

Created on 2019-12-05 by the reprex package (v0.3.0)

bdcallen commented 4 years ago

@iangow If you look at the actual filing, the cusip is literally given as 5021600-10-4, which is 10 digits long. Comparing with the correct cusip 502160104, someone has added an extra, erroneous 0 to the end of the cusip6 part. Having looked at many SC 13D's and SC 13G's, typos like this are quite common.

bdcallen commented 4 years ago

Seems that check_digit might help flag these issues.

Yes, that's indeed why I included this column, to find erroneous 9 digit cusips, as well as provide candidates for cusips with only 8 characters.

iangow commented 4 years ago

@bdcallen

The way we should've handled the task of converting the CUSIP-CIK code from Perl to Python is as follows:

  1. To the extent possible (and easy), replicate the Perl code in Python.
  2. Run the code and compare results (i.e., CUSIPs in one but not the other, etc.).
  3. Fix easy-to-fix issues in Python code, but not Perl code. (No action needed for issues in Perl code, but not Python code.)
  4. Identify and quantify difficult-to-fix issues in Python code, but not Perl code.
  5. Given above arrive at an acceptable baseline level of performance in Python code
  6. Start improving the Python code.

I think instead we sort of started (1), then jumped to (6). And even here we have incomplete code (e.g., we have no documentation, we still have the old code, and it's not obvious how to run the new code).

We keep opening loops without closing old ones and turning small tasks (1-5 above) into big ones. I have started trying to organize the issues associated with this larger task into a project. I think we should come up with a plan to finish this project.

iangow commented 4 years ago
library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())

rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik_old <- tbl(pg, "cusip_cik_old")

old_matches_lost <-
  cusip_cik_old %>%
  filter(nchar(cusip) %in% c(8L, 9L)) %>%
  count(cik, cusip) %>%
  filter(n > 10) %>%
  ungroup() %>%
  anti_join(cusip_cik, by = c("cusip", "cik")) %>%
  compute()

old_matches_lost %>%
  arrange(desc(n)) %>%
  print(n=30)
#> # Source:     table<dbplyr_004> [?? x 3]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>        cik cusip     n      
#>      <int> <chr>     <int64>
#>  1  860510 29874m103 84     
#>  2 1031235 81631y102 78     
#>  3 1001258 046222010 66     
#>  4   14693 01156371  61     
#>  5  737243 004907101 60     
#>  6  789933 629579200 60     
#>  7  889430 09247k109 60     
#>  8  914478 79604v105 56     
#>  9 1267482 81663N20  55     
#> 10  913756 068750V10 48     
#> 11  880460 71376c100 48     
#> 12   60128 54141010  46     
#> 13 1172494 000002255 45     
#> 14 1026734 261384101 44     
#> 15  933955 87305U10  40     
#> 16  800083 83568g104 39     
#> 17 1572236 58436r201 34     
#> 18   60714 50216001  34     
#> 19   89140 817732321 34     
#> 20  728478 22051J308 34     
#> 21  917126 747456P1  32     
#> 22  749254 876263610 30     
#> 23  874761 001300H10 30     
#> 24  912240 36127110  30     
#> 25 1334933 000133493 28     
#> 26 1008597 000727025 28     
#> 27 1063167 000106316 28     
#> 28  869570 78462k102 28     
#> 29  740663 December3 28     
#> 30 1057083 69325q105 27     
#> # … with more rows

old_matches_lost %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 256

new_matches_gained <-
  cusip_cik %>%
  filter(nchar(cusip) %in% c(8L, 9L)) %>%
  count(cik, cusip) %>%
  filter(n > 10) %>%
  ungroup() %>%
  anti_join(cusip_cik_old, by = c("cusip", "cik")) %>%
  compute()

new_matches_gained %>%
  arrange(desc(n)) %>%
  print(n=30)
#> # Source:     table<dbplyr_008> [?? x 3]
#> # Database:   postgres [igow@10.101.13.99:5432/crsp]
#> # Ordered by: desc(n)
#>        cik cusip     n      
#>      <int> <chr>     <int64>
#>  1 1209466 929903102 324    
#>  2 1100663 929903102 184    
#>  3 1064642 929903102 114    
#>  4  789933 62957920   90    
#>  5  768710 75952R100  88    
#>  6 1350487 929903102  88    
#>  7  789933 295792002  88    
#>  8  811419 02355926   86    
#>  9   52848 929903102  84    
#> 10  770034 074223065  78    
#> 11 1001258 462220109  64    
#> 12 1034592 68212U104  62    
#> 13 1475115 29975E109  60    
#> 14 1439397 66702Q203  60    
#> 15 1085601 94856P102  59    
#> 16  790272 63935Q100  58    
#> 17  879354 29425910G  52    
#> 18  749254 762636105  52    
#> 19  930667 929903102  50    
#> 20 1172494 000022551  48    
#> 21  723574 45244M102  48    
#> 22  908837 62874M104  47    
#> 23 1064641 929903102  46    
#> 24  815097 14365830   46    
#> 25   74273 680293305  44    
#> 26 1125259 14365830   44    
#> 27 1378872 929903102  42    
#> 28 1585608 47010C300  42    
#> 29  908837 45865E108  42    
#> 30 1520358 57660106   40    
#> # … with more rows

new_matches_gained %>%
  count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 603

Created on 2020-04-12 by the reprex package (v0.3.0)

iangow commented 4 years ago

Closing in favour of #76.