Closed bdcallen closed 4 years ago
@iangow I've just found a separate issue out of this one. Have a look at this filing. One of the relevant parts of the text to find the cusip is this
"0048l6-l0-4 \n ______________________________\n (CUSIP Number)\n\n\n"
What most would have read as a 1
is actually a lower case L (both occurances)! After changing get_cusip_cik to
def get_cusip_cik(file_name):
try:
url = get_filing_txt_url(file_name)
page = requests.get(url)
# Following three lines omit source code for added files, pdfs, gifs, etc...
page_end = re.search(b'</DOCUMENT>', page.content).end()
content = page.content[:page_end] + b'\n</SEC-DOCUMENT>'
soup = BeautifulSoup(content, 'html.parser')
if(exceeded_sec_request_limit(soup)):
raise ConnectionRefusedError("Hit SEC's too many requests page. Abort")
cik, company_name = get_subject_cik_company_name(file_name, soup)
text = soup.getText()
cusip_hdr = r'CUSIP\s+(?:|NO\.|#|NUMBER)[:]?'
cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,9})'
regex_dict = {'A': cusip_fmt + r'[\s\r\t\n]*[_\.-]?\s*(?:[_\.-]{9,})?[\s\r\t\n]*' + \
r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)',
'B': cusip_fmt + '[\s\t\r]*[\n]?' + r'[\s\t\r]*' + \
r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)',
'C': '[\s_]+' + cusip_hdr + '[ _]{0,50}' + cusip_fmt + '\s+',
'D': '[\s_]+' + cusip_hdr + '(?:\n[\s_]{0,50}){1,2}' + cusip_fmt + '\s+'
}
df_list = []
for key, regex in regex_dict.items():
matches = re.findall(regex, text.upper())
cusips = [re.sub('[^0-9A-Z]', '', match) for match in matches if len(match) > 0]
check_digits = [calculate_cusip_check_digit(cusip) for cusip in cusips]
if(len(cusips)):
df = pd.DataFrame({'cusip': cusips, 'check_digit': check_digits})
df['format'] = key
df['file_name'] = file_name
df['cik'] = cik
df['company_name'] = company_name
df = df[["file_name", "cusip", "cik", "check_digit", "company_name", "format"]]
else:
df = pd.DataFrame({"file_name": [], "cusip": [], "cik": [], "check_digit": [], \
"company_name": [], "format": []})
df_list.append(df)
full_df = pd.concat(df_list)
if(full_df.shape[0]):
formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))
full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])
full_df = full_df[['file_name', 'cusip', 'check_digit', 'cik', 'company_name', 'formats']]
full_df = full_df.drop_duplicates().reset_index(drop = True)
full_df['cik'] = full_df['cik'].astype(np.int64)
full_df['check_digit'] = full_df['check_digit'].astype(np.int64)
return(full_df)
else:
full_df = pd.DataFrame({"file_name": [file_name], "cusip": [None], "check_digit": [None], \
"cik": cik, "company_name": company_name, "formats": [None]})
return(full_df)
except ConnectionRefusedError:
raise
except:
return(None)
this was confirmed when I computed get_cusip_cik
on the filing
file_name | cusip | check_digit | cik | company_name | formats
-- | -- | -- | -- | -- | --
edgar/data/1014360/0001014360-96-000002.txt | 0048L6L04 | 0 | 2098 | ACME UNITED CORP | AC
However, a check from stocknames
confirms that the L should actually be the number 1
crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '00481610'
;
permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+------------+------------+----------+----------+--------+------------------+-------+--------+-------+-------+--------+------------+------------+---------
60038 | 370 | 1972-12-14 | 1977-10-02 | 00481610 | 00481610 | | ACME UNITED CORP | 2 | 3 | 0 | 11 | | 1972-12-29 | 2019-12-31 | 2
60038 | 370 | 1977-10-03 | 2019-12-31 | 00481610 | 00481610 | ACU | ACME UNITED CORP | 2 | 2 | 3421 | 11 | | 1972-12-29 | 2019-12-31 | 2
(2 rows)
crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '0048L6L0'
;
permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)
crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '0048L610'
;
permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)
crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '004816L0'
;
permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)
So we have a problem here where there is a lower case L in place of the number 1 in the cusip (I have found this to be a common problem elsewhere when making sc13dg_indexes).
@iangow I am going to make the above change to get_cusip_cik
permanent, and have the searched text converted to upper case. This resolves the problem of having lower cases in the cusips in the database. Also, from my experience spending countless hours trying to scrape these forms other ways, I believe there is very little cost in looking for cusips from the text converted to upper case, as the cusips are almost always in specific parts of these forms (the word CUSIP
very rarely appears in a slab of text in the item section or in a footnote, for instance).
@bdcallen Try to put the punchline at the top of the comment. I think the punchline of this comment is:
We have cases where a lower case
L
is used where the correct CUSIP contains a1
.
The rest is detail (e.g., examples, code).
For this particular issue, I don't think we should worry about it unless we have evidence that we would not be getting the correct CUSIP-CIK matches from other filings. I think correcting errors in filings is a "bridge too far" for what we're trying to do here.
@iangow I am going to make the above change to
get_cusip_cik
permanent, and have the searched text converted to upper case. This resolves the problem of having lower cases in the cusips in the database. Also, from my experience spending countless hours trying to scrape these forms other ways, I believe there is very little cost in looking for cusips from the text converted to upper case, as the cusips are almost always in specific parts of these forms (the wordCUSIP
very rarely appears in a slab of text in the item section or in a footnote, for instance).
OK. I think your code it too linear and procedural. I think there should be a function that takes text as an argument and returns a list of matched CUSIPs. (Perl does not support functions anywhere near as nicely as Python, so I wouldn't use the Perl code as a model here.) Smaller, more-focused functions will make for easier-to-read and easier-to-maintain code.
I wonder if it doesn't make sense to use a different regular expression to do a match for lower-case CUSIPs just so we keep track of cases where we are fixing the CUSIPs (by converting to upper-case). Using [a-z]
in place of [A-Z]
would work so long as there aren't mixed-case CUSIPs. (While I said 'I think correcting errors in filings is a "bridge too far" for what we're trying to do here', I think fixing the case is sufficiently trivial that we can do it.)
For example I tweaked your code above here to pull the bs4
part out of the main function. I also made the code more Pythonesque (return something
not return(something)
, which is R).
We have no CUSIPs in the table with lower-case letters in them. Have we already converted and added these? (If so, we should close this issue.) Or do we need to run the code again to collect these?
library(dplyr, warn.conflicts = FALSE)
library(DBI)
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik_old <- tbl(pg, "cusip_cik_old")
cusip_cik <- tbl(pg, "cusip_cik")
incremental_matches <-
cusip_cik_old %>%
filter(cusip %~% '[a-z]', nchar(cusip) == 9L) %>%
select(file_name, cusip, cik) %>%
left_join(cusip_cik, by=c("file_name", "cik"))
incremental_matches
#> # Source: lazy query [?? x 7]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> file_name cusip.x cik cusip.y check_digit company_name formats
#> <chr> <chr> <int> <chr> <int> <chr> <chr>
#> 1 edgar/data/13… m81865… 1.02e6 818651… 6 RADCOM LTD A
#> 2 edgar/data/10… Decemb… 1.06e6 <NA> NA HERSHA HOSPIT… <NA>
#> 3 edgar/data/10… 25470b… 1.31e6 <NA> NA Discovery Ban… <NA>
#> 4 edgar/data/70… Decemb… 7.04e5 <NA> NA HUDSON UNITED… <NA>
#> 5 edgar/data/91… 79604v… 9.14e5 <NA> NA SAMSONITE COR… <NA>
#> 6 edgar/data/88… 78387p… 8.80e5 <NA> NA SBS TECHNOLOG… <NA>
#> 7 edgar/data/10… 01877h… 9.13e5 <NA> NA ALLIANCE SEMI… <NA>
#> 8 edgar/data/85… Decemb… 8.60e5 811904… 1 SEACOR HOLDIN… CD
#> 9 edgar/data/80… Decemb… 8.02e5 <NA> NA SILICON GRAPH… <NA>
#> 10 edgar/data/11… 21872p… 1.13e6 21872P… 1 CORGENTECH INC AB
#> # … with more rows
incremental_matches %>%
count()
#> # Source: lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#> n
#> <int64>
#> 1 5932
Created on 2020-04-22 by the reprex package (v0.3.0)
@iangow
We have no CUSIPs in the table with lower-case letters in them
The cases with cusips with lower-case letters in them appeared in the old table cusip_cik_old
, not the table I made, which is the current cusip_cik
. I have not rerun the code to collect these in a new rendition of cusip_cik
yet (I could do this over the weekend).
Let's do things we can check without running the Python code first.
@iangow I think we can close this one, if you're satisfied with my last comment in #83.
@iangow
This issue is in reference to these points in #76