Address case sensitivity of Cusip matches for edgar.cusip_cik

bdcallen commented 4 years ago

@iangow

We've lost cases with Cusip instead of CUSIP. I would just add Cusip back as an explicit alternative. Perl regular expressions are (of course) case-sensitive, but there may be a point in the code where I turned this off. (In fact, I think the i flag at the end of ($lines =~ /($cusip_fmt)\s+(?:[_-]{9,})?\s*$CUSIP Number$/si) has this effect. It's been a while, so I don't remember. We may want to be more judicious than simply eliminating case-sensitivity, though there's no doubt that CuSiP NUmbER can only refer to a CUSIP!)

We've lost cases where the CUSIP included characters in [a-z]. I think we should do what you suggest and look for those, but I think we should convert them to upper case before storing them if that's what CUSIPs should be.

This issue is in reference to these points in #76

bdcallen commented 4 years ago

@iangow I've just found a separate issue out of this one. Have a look at this filing. One of the relevant parts of the text to find the cusip is this

"0048l6-l0-4           \n                 ______________________________\n                         (CUSIP Number)\n\n\n"

What most would have read as a 1 is actually a lower case L (both occurances)! After changing get_cusip_cik to

def get_cusip_cik(file_name):

    try:

        url = get_filing_txt_url(file_name)
        page = requests.get(url)
        # Following three lines omit source code for added files, pdfs, gifs, etc...
        page_end = re.search(b'</DOCUMENT>', page.content).end() 
        content = page.content[:page_end] + b'\n</SEC-DOCUMENT>'
        soup = BeautifulSoup(content, 'html.parser')

        if(exceeded_sec_request_limit(soup)):

            raise ConnectionRefusedError("Hit SEC's too many requests page. Abort")

        cik, company_name = get_subject_cik_company_name(file_name, soup)

        text = soup.getText()

        cusip_hdr = r'CUSIP\s+(?:|NO\.|#|NUMBER)[:]?'
        cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,9})'

        regex_dict = {'A': cusip_fmt + r'[\s\r\t\n]*[_\.-]?\s*(?:[_\.-]{9,})?[\s\r\t\n]*' +  \
        r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)',
                      'B': cusip_fmt + '[\s\t\r]*[\n]?' + r'[\s\t\r]*' +  \
        r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)',
                      'C': '[\s_]+' + cusip_hdr + '[ _]{0,50}' + cusip_fmt + '\s+',
                      'D': '[\s_]+' + cusip_hdr + '(?:\n[\s_]{0,50}){1,2}' + cusip_fmt + '\s+'
                     }

        df_list = []

        for key, regex in regex_dict.items():

            matches = re.findall(regex, text.upper())

            cusips = [re.sub('[^0-9A-Z]', '', match) for match in matches if len(match) > 0]
            check_digits = [calculate_cusip_check_digit(cusip) for cusip in cusips]

            if(len(cusips)):
                df = pd.DataFrame({'cusip': cusips, 'check_digit': check_digits})
                df['format'] = key
                df['file_name'] = file_name
                df['cik'] = cik
                df['company_name'] = company_name
                df = df[["file_name", "cusip", "cik", "check_digit", "company_name", "format"]]

            else:
                df = pd.DataFrame({"file_name": [], "cusip": [], "cik": [], "check_digit": [], \
                                   "company_name": [], "format": []})

            df_list.append(df)

        full_df = pd.concat(df_list)

        if(full_df.shape[0]):

            formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))

            full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])

            full_df = full_df[['file_name', 'cusip', 'check_digit', 'cik', 'company_name', 'formats']]

            full_df = full_df.drop_duplicates().reset_index(drop = True)

            full_df['cik'] = full_df['cik'].astype(np.int64)
            full_df['check_digit'] = full_df['check_digit'].astype(np.int64)

            return(full_df)

        else:

            full_df = pd.DataFrame({"file_name": [file_name], "cusip": [None], "check_digit": [None], \
                                    "cik": cik, "company_name": company_name, "formats": [None]})

        return(full_df)

    except ConnectionRefusedError:

        raise

    except:

        return(None)

this was confirmed when I computed get_cusip_cik on the filing

file_name | cusip | check_digit | cik | company_name | formats
-- | -- | -- | -- | -- | --
edgar/data/1014360/0001014360-96-000002.txt | 0048L6L04 | 0 | 2098 | ACME UNITED CORP | AC

However, a check from stocknames confirms that the L should actually be the number 1

crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '00481610'
;
 permno | permco |   namedt   | nameenddt  |  cusip   |  ncusip  | ticker |      comnam      | hexcd | exchcd | siccd | shrcd | shrcls |  st_date   |  end_date  | namedum
--------+--------+------------+------------+----------+----------+--------+------------------+-------+--------+-------+-------+--------+------------+------------+---------
  60038 |    370 | 1972-12-14 | 1977-10-02 | 00481610 | 00481610 |        | ACME UNITED CORP |     2 |      3 |     0 |    11 |        | 1972-12-29 | 2019-12-31 |       2
  60038 |    370 | 1977-10-03 | 2019-12-31 | 00481610 | 00481610 | ACU    | ACME UNITED CORP |     2 |      2 |  3421 |    11 |        | 1972-12-29 | 2019-12-31 |       2
(2 rows)
crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '0048L6L0'
;
 permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)

crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '0048L610'
;
 permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)

crsp=# SELECT * FROM crsp.stocknames
WHERE ncusip = '004816L0'
;
 permno | permco | namedt | nameenddt | cusip | ncusip | ticker | comnam | hexcd | exchcd | siccd | shrcd | shrcls | st_date | end_date | namedum
--------+--------+--------+-----------+-------+--------+--------+--------+-------+--------+-------+-------+--------+---------+----------+---------
(0 rows)

So we have a problem here where there is a lower case L in place of the number 1 in the cusip (I have found this to be a common problem elsewhere when making sc13dg_indexes).

bdcallen commented 4 years ago

@iangow I am going to make the above change to get_cusip_cik permanent, and have the searched text converted to upper case. This resolves the problem of having lower cases in the cusips in the database. Also, from my experience spending countless hours trying to scrape these forms other ways, I believe there is very little cost in looking for cusips from the text converted to upper case, as the cusips are almost always in specific parts of these forms (the word CUSIP very rarely appears in a slab of text in the item section or in a footnote, for instance).

iangow commented 4 years ago

@bdcallen Try to put the punchline at the top of the comment. I think the punchline of this comment is:

We have cases where a lower case L is used where the correct CUSIP contains a 1.

The rest is detail (e.g., examples, code).

For this particular issue, I don't think we should worry about it unless we have evidence that we would not be getting the correct CUSIP-CIK matches from other filings. I think correcting errors in filings is a "bridge too far" for what we're trying to do here.

iangow commented 4 years ago

@iangow I am going to make the above change to get_cusip_cik permanent, and have the searched text converted to upper case. This resolves the problem of having lower cases in the cusips in the database. Also, from my experience spending countless hours trying to scrape these forms other ways, I believe there is very little cost in looking for cusips from the text converted to upper case, as the cusips are almost always in specific parts of these forms (the word CUSIP very rarely appears in a slab of text in the item section or in a footnote, for instance).

OK. I think your code it too linear and procedural. I think there should be a function that takes text as an argument and returns a list of matched CUSIPs. (Perl does not support functions anywhere near as nicely as Python, so I wouldn't use the Perl code as a model here.) Smaller, more-focused functions will make for easier-to-read and easier-to-maintain code.

I wonder if it doesn't make sense to use a different regular expression to do a match for lower-case CUSIPs just so we keep track of cases where we are fixing the CUSIPs (by converting to upper-case). Using [a-z] in place of [A-Z] would work so long as there aren't mixed-case CUSIPs. (While I said 'I think correcting errors in filings is a "bridge too far" for what we're trying to do here', I think fixing the case is sufficiently trivial that we can do it.)

iangow commented 4 years ago

For example I tweaked your code above here to pull the bs4 part out of the main function. I also made the code more Pythonesque (return something not return(something), which is R).

iangow commented 4 years ago

We have no CUSIPs in the table with lower-case letters in them. Have we already converted and added these? (If so, we should close this issue.) Or do we need to run the code again to collect these?

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik_old <- tbl(pg, "cusip_cik_old")
cusip_cik <- tbl(pg, "cusip_cik")

incremental_matches <-
    cusip_cik_old %>% 
    filter(cusip %~% '[a-z]', nchar(cusip) == 9L) %>%
    select(file_name, cusip, cik) %>%
    left_join(cusip_cik, by=c("file_name", "cik")) 

incremental_matches
#> # Source:   lazy query [?? x 7]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>    file_name      cusip.x    cik cusip.y check_digit company_name   formats
#>    <chr>          <chr>    <int> <chr>         <int> <chr>          <chr>  
#>  1 edgar/data/13… m81865… 1.02e6 818651…           6 RADCOM LTD     A      
#>  2 edgar/data/10… Decemb… 1.06e6 <NA>             NA HERSHA HOSPIT… <NA>   
#>  3 edgar/data/10… 25470b… 1.31e6 <NA>             NA Discovery Ban… <NA>   
#>  4 edgar/data/70… Decemb… 7.04e5 <NA>             NA HUDSON UNITED… <NA>   
#>  5 edgar/data/91… 79604v… 9.14e5 <NA>             NA SAMSONITE COR… <NA>   
#>  6 edgar/data/88… 78387p… 8.80e5 <NA>             NA SBS TECHNOLOG… <NA>   
#>  7 edgar/data/10… 01877h… 9.13e5 <NA>             NA ALLIANCE SEMI… <NA>   
#>  8 edgar/data/85… Decemb… 8.60e5 811904…           1 SEACOR HOLDIN… CD     
#>  9 edgar/data/80… Decemb… 8.02e5 <NA>             NA SILICON GRAPH… <NA>   
#> 10 edgar/data/11… 21872p… 1.13e6 21872P…           1 CORGENTECH INC AB     
#> # … with more rows

incremental_matches %>%
    count()
#> # Source:   lazy query [?? x 1]
#> # Database: postgres [igow@10.101.13.99:5432/crsp]
#>   n      
#>   <int64>
#> 1 5932

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

bdcallen commented 4 years ago

@iangow

We have no CUSIPs in the table with lower-case letters in them

The cases with cusips with lower-case letters in them appeared in the old table cusip_cik_old, not the table I made, which is the current cusip_cik. I have not rerun the code to collect these in a new rendition of cusip_cik yet (I could do this over the weekend).

iangow commented 4 years ago

Let's do things we can check without running the Python code first.

bdcallen commented 4 years ago

@iangow I think we can close this one, if you're satisfied with my last comment in #83.

mccgr / edgar

Address case sensitivity of Cusip matches for edgar.cusip_cik #80