CUSIPs with less than 9 characters + check digits

bdcallen commented 4 years ago

@iangow

This issue if for deciding how to handle cases with CUSIPs that have less than 9 characters, and to check the check digits. I have already made some progress on the latter, making a function calculate_cusip_check_digit, shown below

def calculate_cusip_check_digit(cusip):

    values = {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9,
              'A': 10, 'B':11, 'C': 12, 'D': 13, 'E':14, 'F': 15, 'G': 16, 'H': 17, 'I': 18, 'J': 19,
              'K': 20, 'L': 21, 'M': 22, 'N': 23, 'O': 24, 'P': 25, 'Q': 26, 'R': 27, 'S': 28, 'T': 29,
              'U': 30, 'V': 31, 'W': 32, 'X': 33, 'Y': 34, 'Z': 35, '*': 36, '@': 37, '#': 38
               }

    digit_str = ''

    if(len(cusip) >= 8):

        for i in range(8):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    elif(len(cusip) >= 6):

        for i in range(6):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    elif(len(cusip) >= 3):

        cusip = '0' * (9 - len(cusip)) + cusip

        for i in range(8):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    else:

        return(None)

This function uses the Luhn algorithm, as described in the manual here, to calculate whath should be the check digit. It works correctly at the moment for an 8 or 9 character string, anything with less characters I have just made a rough guess as to what the formula should be (which I think is currently wrong). I'm also going to add the check digit as calculated from this function as a column in cusip_cik, as it gives a good way to check for erroneous entries (of which there are quite a few).

iangow commented 4 years ago

@bdcallen How would we use the check digit with an eight-digit CUSIP. Isn't the ninth digit the check-digit? If that's missing, what are we checking?

bdcallen commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits. It was more cases with even less digits (like 6 or 7), in which we might decide to make a guess to complete the 9 digit cusip. For instance, if we know that a security is a common stock, we could guess that the issue number (the 7th and 8th digit) is 10, which is by far the most common issue number for common stock, and then calculate the check digit (and hence the 9th character of the cusip) from there. Then again, having had a closer look at stocknames and cusipm the last few days, perhaps the most important thing in cusip-cik mapping is mapping the ciks to the cusip6, since the cusip6 is what is associated with the issuer, and we could do matchings to any full cusips for any security once we know the cusip6 (just match over first 6 characters). Also, there is the problem of extrapolation, and in assuming the issue number (even though 10, 20, 30 and so on are by far the most common).

I think we have other issues now for dealing with cases with cusips less than 9 characters (particularly with 6 and 7). Perhaps this issue should be closed (maybe after I change the code for the function above to return None for cases with less than 8 digits, if say that's what you would prefer).

iangow commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits.

I see. But we can't check those CUSIPs. So I don't see the point. I think we want to evaluate the merits of retaining sub-9-digit CUSIPs by the number of digits:

8 digits. How many additional valid 8-digit CUSIPs do we get by adding 8-digit CUSIPs to a table produced from 9-digit CUSIPs? (The valid part is what is difficult to check; we should hand-check a sample and if they seem mostly good, we keep them.)
7 digits. Let's check, but I assume that these are 8-digit CUSIPs with a leading 0 lopped off. We should evaluate these against the table produced from 8- and 9-digit CUSIPs.
6 digits. Having produced a table from the above, how much do we get in terms of incremental valid CUSIP-CIK matches by looking at six-digit CUSIPs.
5 or fewer digits. I assume that we don't want to bother with these.

We should make an issue for each bullet point above.

iangow commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits.

And the goal is for this table to contain raw data on CUSIPs extracted from filings, not to create guesses of nine-digit CUSIPs based on those data.

mccgr / edgar

CUSIPs with less than 9 characters + check digits #61