Capture CUSIPs preceded by Cusip

@iangow Here's the current code for the function extract_cusips

def extract_cusips(text):

    cusip_hdr = r'CUSIP\s+(?:NO\.|#|NUMBER)[:]?'
    cusip_fmt = '((?:[0-9A-Z]{1}[ -]{0,3}){6,12})'

    regex_dict = {'A': cusip_fmt + r'[\s\r\t\n]*[_\.-]?\s*(?:[_\.-]{9,})?[\s\r\t\n]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'B': cusip_fmt + '[\s\t\r]*[\n]?' + r'[\s\t\r]*' +  \
    r'\(CUSIP\s+(?:NUMBER|NUMBER\s+OF\s+CLASS\s+OF\s+SECURITIES)\)\s*\n',
                  'C': '[\s_]+' + cusip_hdr + '[ _]{0,50}' + cusip_fmt + '\s+',
                  'D': '[\s_]+' + cusip_hdr + '(?:\n[\s_]{0,50}){1,2}' + cusip_fmt + '\s+'
                 }

    df_list = []

    for key, regex in regex_dict.items():

        matches = re.findall(regex, text.upper())

        cusips = [re.sub('[^0-9A-Z]', '', match) for match in matches if len(match) > 0]
        check_digits = [calculate_cusip_check_digit(cusip) for cusip in cusips]

        if(len(cusips)):
            df = pd.DataFrame({'cusip': cusips, 'check_digit': check_digits})
            df['format'] = key
            df = df[["cusip", "check_digit", "format"]]

        else:
            df = pd.DataFrame({"cusip": [], "check_digit": [], "format": []})

        df_list.append(df)

    full_df = pd.concat(df_list)

    if(full_df.shape[0]):

        formats = full_df.groupby('cusip').apply(lambda x: ''.join(x['format'].unique().tolist()))
        full_df['formats'] = full_df['cusip'].apply(lambda x: formats[x])
        full_df = full_df[["cusip", "check_digit", "formats"]]
        full_df = full_df.drop_duplicates().reset_index(drop = True)

        return(full_df)

    else:

        full_df = pd.DataFrame({"cusip": [None], "check_digit": [None], "formats": [None]})

    return(full_df)

The commit which updated this function can be found here, and it was made on April 16th, which is after the time we first discussed #76, so I have not run the code with this change yet. As you can see above, it converts the text to upper case, making the regex searching case-insensitive. I would argue that making the code case-insensitive is sufficient to address this issue, as well as #80, for these reasons:

(1) If we extract some bad cusips as a result, we have ways to eliminate them (ie. only choosing valid 9-digit cusips, eliminating cusips which end in a letter), which we utilized in making cusip_cik_test.

(2) Cusip numbers tend to appear in certain specific parts of SC 13D and SC 13G. They typically appear in:

the title section
at the head of the cover pages
In certain items of the item section (eg. Item 2(e), I think, of SC 13D)

From what I've seen, it is rare for cusips to appear in a paragraph of text, for instance. Even if we pick up such cases, and they provide bad cusips, we still have the solutions for eliminating bad cusips from (1).

If you're happy with my reasoning here, I think we close this, and #80. Also, if you're happy for me to make a new edgar.cusip_cik using the current code, I can do so (though perhaps see what I'm about to write for #93 first).

mccgr / edgar

Capture CUSIPs preceded by Cusip #83