mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

CUSIPs with less than 9 characters + check digits #61

Closed bdcallen closed 4 years ago

bdcallen commented 4 years ago

@iangow

This issue if for deciding how to handle cases with CUSIPs that have less than 9 characters, and to check the check digits. I have already made some progress on the latter, making a function calculate_cusip_check_digit, shown below

def calculate_cusip_check_digit(cusip):

    values = {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9,
              'A': 10, 'B':11, 'C': 12, 'D': 13, 'E':14, 'F': 15, 'G': 16, 'H': 17, 'I': 18, 'J': 19,
              'K': 20, 'L': 21, 'M': 22, 'N': 23, 'O': 24, 'P': 25, 'Q': 26, 'R': 27, 'S': 28, 'T': 29,
              'U': 30, 'V': 31, 'W': 32, 'X': 33, 'Y': 34, 'Z': 35, '*': 36, '@': 37, '#': 38
               }

    digit_str = ''

    if(len(cusip) >= 8):

        for i in range(8):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    elif(len(cusip) >= 6):

        for i in range(6):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    elif(len(cusip) >= 3):

        cusip = '0' * (9 - len(cusip)) + cusip

        for i in range(8):

            if(i % 2 == 0):
                digit_str = digit_str + str(values[cusip[i]])
            else:
                digit_str = digit_str + str(2 * values[cusip[i]])

        result = 0

        for i in range(len(digit_str)):

            result = result + int(digit_str[i])

        result = (10 - result) % 10

        return(result)

    else:

        return(None)

This function uses the Luhn algorithm, as described in the manual here, to calculate whath should be the check digit. It works correctly at the moment for an 8 or 9 character string, anything with less characters I have just made a rough guess as to what the formula should be (which I think is currently wrong). I'm also going to add the check digit as calculated from this function as a column in cusip_cik, as it gives a good way to check for erroneous entries (of which there are quite a few).

iangow commented 4 years ago

@bdcallen How would we use the check digit with an eight-digit CUSIP. Isn't the ninth digit the check-digit? If that's missing, what are we checking?

bdcallen commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits. It was more cases with even less digits (like 6 or 7), in which we might decide to make a guess to complete the 9 digit cusip. For instance, if we know that a security is a common stock, we could guess that the issue number (the 7th and 8th digit) is 10, which is by far the most common issue number for common stock, and then calculate the check digit (and hence the 9th character of the cusip) from there. Then again, having had a closer look at stocknames and cusipm the last few days, perhaps the most important thing in cusip-cik mapping is mapping the ciks to the cusip6, since the cusip6 is what is associated with the issuer, and we could do matchings to any full cusips for any security once we know the cusip6 (just match over first 6 characters). Also, there is the problem of extrapolation, and in assuming the issue number (even though 10, 20, 30 and so on are by far the most common).

I think we have other issues now for dealing with cases with cusips less than 9 characters (particularly with 6 and 7). Perhaps this issue should be closed (maybe after I change the code for the function above to return None for cases with less than 8 digits, if say that's what you would prefer).

iangow commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits.

I see. But we can't check those CUSIPs. So I don't see the point. I think we want to evaluate the merits of retaining sub-9-digit CUSIPs by the number of digits:

We should make an issue for each bullet point above.

iangow commented 4 years ago

@iangow I made this issue assuming that the full 9 digits were of interest. If we just have 8 digits, we can reliably calculate the check digit, and hence derive the 9th digit from the other 8 digits.

And the goal is for this table to contain raw data on CUSIPs extracted from filings, not to create guesses of nine-digit CUSIPs based on those data.