mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Capture CUSIPs preceded by Cusip #83

Closed iangow closed 3 years ago

iangow commented 4 years ago

Discussed as item 1 in the list here.

  1. We've lost cases with Cusip instead of CUSIP. I would just add Cusip back as an explicit alternative. Perl regular expressions are (of course) case-sensitive, but there may be a point in the code where I turned this off. (In fact, I think the i flag at the end of ($lines =~ /($cusip_fmt)\s+(?:[_-]{9,})?\s*\(CUSIP Number\)/si) has this effect. It's been a while, so I don't remember. We may want to be more judicious than simply eliminating case-sensitivity, though there's no doubt that CuSiP NUmbER can only refer to a CUSIP!)
bdcallen commented 3 years ago

@iangow Here's the current code for the function extract_cusips

def extract_cusips(text):

    cusip_hdr = r'CUSIP\s+(?:NO\.|#|NUMBER)[:]?'
    cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,12})'

    regex_dict = {'A': cusip_fmt + r'[\s\r\t\n]*[_\.-]?\s*(?:[_\.-]{9,})?[\s\r\t\n]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'B': cusip_fmt + '[\s\t\r]*[\n]?' + r'[\s\t\r]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'C': '[\s_]+' + cusip_hdr + '[ _]{0,50}' + cusip_fmt + '\s+',
                  'D': '[\s_]+' + cusip_hdr + '(?:\n[\s_]{0,50}){1,2}' + cusip_fmt + '\s+'
                 }

    df_list = []

    for key, regex in regex_dict.items():

        matches = re.findall(regex, text.upper())

        cusips = [re.sub('[^0-9A-Z]', '', match) for match in matches if len(match) > 0]
        check_digits = [calculate_cusip_check_digit(cusip) for cusip in cusips]

        if(len(cusips)):
            df = pd.DataFrame({'cusip': cusips, 'check_digit': check_digits})
            df['format'] = key
            df = df[["cusip", "check_digit", "format"]]

        else:
            df = pd.DataFrame({"cusip": [], "check_digit": [], "format": []})

        df_list.append(df)

    full_df = pd.concat(df_list)

    if(full_df.shape[0]):

        formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))
        full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])
        full_df = full_df[["cusip", "check_digit", "formats"]]
        full_df = full_df.drop_duplicates().reset_index(drop = True)

        return(full_df)

    else:

        full_df = pd.DataFrame({"cusip": [None], "check_digit": [None], "formats": [None]})

    return(full_df)

The commit which updated this function can be found here, and it was made on April 16th, which is after the time we first discussed #76, so I have not run the code with this change yet. As you can see above, it converts the text to upper case, making the regex searching case-insensitive. I would argue that making the code case-insensitive is sufficient to address this issue, as well as #80, for these reasons:

(1) If we extract some bad cusips as a result, we have ways to eliminate them (ie. only choosing valid 9-digit cusips, eliminating cusips which end in a letter), which we utilized in making cusip_cik_test.

(2) Cusip numbers tend to appear in certain specific parts of SC 13D and SC 13G. They typically appear in:

From what I've seen, it is rare for cusips to appear in a paragraph of text, for instance. Even if we pick up such cases, and they provide bad cusips, we still have the solutions for eliminating bad cusips from (1).

If you're happy with my reasoning here, I think we close this, and #80. Also, if you're happy for me to make a new edgar.cusip_cik using the current code, I can do so (though perhaps see what I'm about to write for #93 first).