Make program which determines which ciks correspond to companies v individuals

bdcallen commented 4 years ago

@iangow Today, @dingtq asked me if one could make a table determining which cik numbers correspond to actual companies, as opposed to individuals. This information currently can't be derived from the basic tables (ie. filings and so on). We thought we would try the forms345_reporting_owners table, but it turned out that there were many companies that were reporting owners, and that the other features of the table couldn't always distinguish between these and individuals. My current idea is this:

Every entity with a cik that has appeared in filings has filed at least one filing
Every filing has an SGML header file containing information on the entity with the given cik.
Hence, if we can get a dataframe with just one filing per cik, we can analyze the header file and use the information associated with the cik to determine if the identity is a business/company versus an individual.

I have successfully been able to get such a dataframe using the sql command
```
SELECT DISTINCT ON (cik) *
FROM edgar.filings
ORDER BY cik, file_name 
```

I am now working on the second step, using the header file from each filing to extract the required information about each cik.

bdcallen commented 4 years ago

@dingtq Here's a counterexample for just using state of incorporation, as we discussed this afternoon. The cik for this case is 16847, and the company_name is CANADIAN IMPERIAL BANK OF COMMERCE, which is cleanly not an individual. So perhaps we need more variables to test.

dingtq commented 4 years ago

Good catch, @bdcallen. But this case is consistent with our concern that some non-us companies may not provide state of incorporation (some may provide), same for the business address. So for non-us filers, we cannot confidently identify a filer as an individual through non-presence of the state of incorporation or business address. But I am wondering can we identify a foreign filer as a company through the presence of the state of incorporation or business address?

bdcallen commented 4 years ago

@dingtq @iangow Ok, my program get_company_ciks.R has just finished a run. It is abundantly clear that state of incorporation is not enough, as their are many company names among the first 100 entries of edgar.company_ciks, among the many individual names that can be seen later in the rest of the table. So I need to rewrite the is_company function to take into account, such as the presence of a business address, and other variables

bdcallen commented 4 years ago

Also, here's another counterexample, this time the opposite of the one above, which is what appears to be an individual with a State of Incorporation and a Business Address!

dingtq commented 4 years ago

Thank you, Ben, for this. In the second counterexample, I think the individual is the executive/director of the company, so he provides the same State of Incorporation and a Business Address as the company.

bdcallen commented 4 years ago

@dingtq, yes that could be a way to distinguish, at least in those cases. I just found cases like this one, where there is scarce information on one of the ciks in question (in this case, the issuer).

dingtq commented 4 years ago

@bdcallen Yes, it seems that identifying individual/company filers are not so straightforward as we thought. Maybe ultimately, we need to make this judgement based on forms they file or combined with our current approach.

iangow commented 4 years ago

@dingtq What are you trying to achieve here? Do you want to identify individuals? Do you want to identify companies?

dingtq commented 4 years ago

@iangow, @bdcallen, Hi Ian, thank you for asking. I know you are busy, so I tried not to be annoying with too many details. But here they are. Initially, I wanted to identify company filers to assist with downloading pre-processed data from directEdgar. It is related to our idea of extracting directEdgar data and put it on our server. It turns out that directEdgar has extracted/pre-processed text and tables from SEC filings and store them on their Amazon server, which I think could be potentially beneficial for us to have. But to extract this data, one needs to submit a request CSV file to the server through the directEdgar software. The CSV file must have columns of CIK, YEAR, and PF (period focus, which could be FY, Q1, Q2, Q3, etc.) with no more than 39,999 rows. I chatted with Ben, and we thought that we might have to manually submit CSV files and download and then put on our server. (But please let us know if you have a smart and easy way to automate this process). To make this manual process more efficient, it is important to generate the CSV files with all but only company filer CIKs. We thought it should be easy to separate company filer CIKs from individual CIKs, but it is not so far. A side goal for this identification is to get a complete list of company names from the Edgar universe, which can be useful to perform company name matching in some cases. But company name matching could be more complex, and not a priority now. An alternative approach is to generate CSV files for each SEC filing form type, starting from 10K, then 10Q, DEF 14A, etc. If we still cannot figure out a nice way to separate company filers from individual filers within a reasonable time, we will hold it and switch to the alternative approach and proceed. Please let us know your thoughts, or if you want more details.

iangow commented 4 years ago

@dingtq Just do it by filing type (10-K, etc.). Rather than over-engineering a process which may not even work, just create a sample of CIKs, etc., and try that.

dingtq commented 4 years ago

@iangow Got it. Thank you, Ian! @bdcallen Please proceed with what Ian has suggested. You can find more details in the email that I sent you yesterday. Thanks, Ben!

bdcallen commented 4 years ago

@iangow @dingtq Yes I have to agree with Ian here, we should consider doing it by filing type, which will be way quicker. Also, Tony, are we equally interested in all companies? If we're just interested in public companies, for instance, subsetting over 10-K, 8-K, DEF 14A, and so on, should do the job.

dingtq commented 4 years ago

@bdcallen Good, Ben. Let's start with 10-K.

bdcallen commented 4 years ago

@iangow @dingtq Hey Tony, I have run the program as we were discussing last night, and I have just had a look at the resulting table edgar.filing_heading_info. It turns out the relationship between file_name (derived from edgar.filings) and the columns corresponding to directEDGAR (CIK, CNAME, RDATE, CDATE, YEAR, PF and FNAME) are mostly, but not exactly in one-to-one correspondence. In fact, one (CIK, CNAME, RDATE, CDATE, YEAR, PF and FNAME) tuple has a multiplicity of 9

> filing_header_info_df <- tbl(pg, sql("SELECT * FROM edgar.filing_heading_info")) %>% 
+                          rename(CIK = cik, CNAME = cname, RDATE = rdate, CDATE = cdate, YEAR = year, PF = period_focus, FNAME = fname) %>% collect()
> dEDGAR_freq <- filing_header_info_df %>% group_by(CIK, CNAME, RDATE, CDATE, YEAR, PF, FNAME) %>% summarise(freq = n())
> unique(dEDGAR_freq$freq)
[1] 1 2 5 4 9 3
> dEDGAR_freq %>% filter(freq == 9)
# A tibble: 1 x 8
# Groups:   CIK, CNAME, RDATE, CDATE, YEAR, PF [1]
      CIK CNAME                RDATE     CDATE      YEAR PF    FNAME  freq
    <int> <chr>                <chr>     <chr>     <dbl> <chr> <chr> <int>
1 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9
> dEDGAR_freq %>% filter(freq == 9) %>% inner_join(filing_header_info_df)
Joining, by = c("CIK", "CNAME", "RDATE", "CDATE", "YEAR", "PF", "FNAME")
# A tibble: 9 x 10
# Groups:   CIK, CNAME, RDATE, CDATE, YEAR, PF [1]
      CIK CNAME                RDATE     CDATE      YEAR PF    FNAME  freq file_name                                   form_type
    <int> <chr>                <chr>     <chr>     <dbl> <chr> <chr> <int> <chr>                                       <chr>    
1 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001529359-13-000011.txt 10-K     
2 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001511861-13-000011.txt 10-K     
3 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001518904-13-000011.txt 10-K     
4 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001538940-13-000011.txt 10-K     
5 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001534318-13-000011.txt 10-K     
6 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001557473-13-000011.txt 10-K     
7 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001522559-13-000011.txt 10-K     
8 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001555611-13-000011.txt 10-K     
9 1477336 Ally Auto Assets LLC R20130329 C20121231  2013 FY    F11       9 edgar/data/1477336/0001507149-13-000011.txt 10-K

I might just have a closer look at cases like these. I suspect it might be duplication of the same underlying filing with different ciks in the accession number (the number that goes 10 digits - 2 digits - 6 digits), given that really a single company with a single cik should only really have one 10-K per financial year. Note that although the 10 digit number is meant to correspond to a cik, this is often not the same cik as that of the filer (whose cik is in the CIK field).

bdcallen commented 4 years ago

@dingtq @iangow Here's the first filing from that set above. Note there are two filers, Ally Auto Assets LLC (CIK: 0001477336) and Ally Auto Receivables Trust 2011-4 (CIK: 0001529359). Looking at rows with the same accession number in filing_heading_info

crsp=# SELECT * FROM edgar.filing_heading_info
crsp-# WHERE file_name ~ '0001529359-13-000011.txt$';

                  file_name                  | form_type |   cik   |        cname         |   rdate   |   cdate   | year | period_focus | fname
---------------------------------------------+-----------+---------+----------------------+-----------+-----------+------+--------------+-------
 edgar/data/1529359/0001529359-13-000011.txt | 10-K      | 1529359 | Ally Auto Assets LLC | R20130329 | C20121231 | 2013 | FY           | F11
 edgar/data/1477336/0001529359-13-000011.txt | 10-K      | 1477336 | Ally Auto Assets LLC | R20130329 | C20121231 | 2013 | FY           | F11
(2 rows)

we see that this filing is caught twice, one for each CIK. And these will have different tuples for directEDGAR due to the different CIKs. So this filing will be caught under at least one CIK. Ideally though, there would be some other variable for directEDGAR which would get rid of the many-to-one cases (many being the many file_names per tuple. Tony, do you know if such a variable exists?

bdcallen commented 4 years ago

Also, having a look at the example above, I need to delete these cases, amend my code, and then redo them. as the cname is erroneously the same. My code as it currently stands selects the cname for the first FILER, not necessarily that corresponding to the cik. This should be an easy fix. I'll commit the full code and then run it over these cases once done.

dingtq commented 4 years ago

@bdcallen Good catch. I think it is important to examine whether these filings are actually the same filing, submitted by different but related filers through different submission. Put it in other words, the accession number is used as a unique identifier for each submission, but not unique for the underlying files. If that is the case, we don't care much when the tuple has a multiplicity of 9 file names, or accession numbers as long as they point to the same file, which can be identified through the same tuple in directEdgar (CIK, CNAME, RDATE, CDATE, YEAR, PF and FNAME).

According to Edgar, the CIK in the Accession Number is not necessarily the underlying company's CIK. "In the example above, "0001193125-15-118890" is the "accession number," a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System. The first set of numbers (0001193125) is the CIK of the entity submitting the filing. This could be the company or a third-party filer agent. Some filer agents without a regulatory requirement to make disclosure filings with the SEC have a CIK but no searchable presence in the public EDGAR database. The next 2 numbers (15) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to 0 at the start of each calendar year."

On the other hand, I think it is Ok if we have the same file in both the filer's directory and the underlying company's directory.

bdcallen commented 4 years ago

@dingtq I just ran this, after correcting some problems after a run last night, and then finishing off the table just now

library(dplyr)

filing_header_info_df <- tbl(pg, sql("SELECT * FROM edgar.filing_heading_info")) %>% 
                         rename(CIK = cik, CNAME = cname, RDATE = rdate, CDATE = cdate, YEAR = year, PF = period_focus, FNAME = fname) %>% collect()

filing_header_info_towrite <- filing_header_info_df %>% select(CIK, CNAME, RDATE, CDATE, YEAR, PF, FNAME) %>% distinct()  %>% arrange(RDATE)

num_filings <- nrow(filing_header_info_towrite)
batch_size <- 39999

num_batches <- floor(num_filings/batch_size) + 1

for (i in 1:num_batches) {

    start <- (i-1) * batch_size + 1

    if(i == num_batches) {

        end <- num_filings

    } else {

        end <- i * batch_size

    }

    batch <- filing_header_info_towrite[start:end, ]

    write.csv(batch, paste0('/home/shared/directEDGAR/filings_10k_', i, '.csv' ), row.names = FALSE)

}

So the code above takes the dataframe with the distinct (CIK, CNAME, RDATE, CDATE, YEAR, PF, FNAME) tuples ordered in ascending order by RDATE, then segments it into batches with a maximum number of observations equal to 39999 (as you requested), then writes each batch to a file in the folder /home/shared/directEDGAR. The files for each batch are labelled by their batch number, ie. filings_10k_1.csv, filings_10k_2,csv and so on.

Thus the full path for the first file is /home/shared/directEDGAR/filings_10k_1.csv, /home/shared/directEDGAR/filings_10k_2.csv for the second file, and so on. Since the dataframe was ordered by RDATE, filings_10k_1.csv contains the earliest filings, while the last batch with the largest integer, in this case filings_10k_6.csv, contains the most recent filings.

I still think we should find some other variable that directEDGAR can use to distinguish the filings which share the same tuples. Because for those examples above, the filings are in fact different. But if there are no problems after using directEDGAR using this data, I think we can close this.

dingtq commented 4 years ago

Great. Thank you, Ben!

From: Benjamin Callen notifications@github.com Sent: Monday, March 30, 2020 5:21:24 PM To: mccgr/edgar edgar@noreply.github.com Cc: Tongqing Ding tongqing.ding@unimelb.edu.au; Mention mention@noreply.github.com Subject: Re: [mccgr/edgar] Make program which determines which ciks correspond to companies v individuals (#73)

Hi Tony,

I don't know if you just saw my GitHub post on the issue, but I have now taken the distinct tuples (CIK, CNAME, RDATE, CDATE, YEAR, PF and FNAME) derived from edgar.filing_heading_info and written them to a bunch of csv files, as you initially requested, despite some of the problems with a small number of filings sharing tuples. In my post, I state the following:

"""

So the code above takes the dataframe with the distinct (CIK, CNAME, RDATE, CDATE, YEAR, PF, FNAME) tuples ordered in ascending order by RDATE, then segments it into batches with a maximum number of observations equal to 39999 (as you requested), then writes each batch to a file in the folder /home/shared/directEDGAR. The files for each batch are labelled by their batch number, ie. filings_10k_1.csv, filings_10k_2,csv and so on.

Thus the full path for the first file is /home/shared/directEDGAR/filings_10k_1.csv, /home/shared/directEDGAR/filings_10k_2.csv for the second file, and so on. Since the dataframe was ordered by RDATE, filings_10k_1.csv contains the earliest filings, while the last batch with the largest integer, in this case filings_10k_6.csv, contains the most recent filings.

"""

So I hope you get really good use out of this. Let me know if you run into an issues. I will commit the code I used now, just so we retain a record.

Cheers, Ben

From: Tony Ding notifications@github.com Sent: Friday, 27 March 2020 10:53 PM To: mccgr/edgar edgar@noreply.github.com Cc: Benjamin Callen ben_callen@hotmail.com; Mention mention@noreply.github.com Subject: Re: [mccgr/edgar] Make program which determines which ciks correspond to companies v individuals (#73)

@bdcallenhttps://github.com/bdcallen Good catch. I think it is important to examine whether these filings are actually the same filing, submitted by different but related filers through different submission. Put it in other words, the accession number is used as a unique identifier for each submission, but not unique for the underlying files. If that is the case, we don't care much when the tuple has a multiplicity of 9 file names, or accession numbers as long as they point to the same file, which can be identified through the same tuple in directEdgar (CIK, CNAME, RDATE, CDATE, YEAR, PF and FNAME).

According to Edgar, the CIK in the Accession Number is not necessarily the underlying company's CIK. "In the example above, "0001193125-15-118890" is the "accession number," a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System. The first set of numbers (0001193125) is the CIK of the entity submitting the filing. This could be the company or a third-party filer agent. Some filer agents without a regulatory requirement to make disclosure filings with the SEC have a CIK but no searchable presence in the public EDGAR database. The next 2 numbers (15) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to 0 at the start of each calendar year."

On the other hand, I think it is Ok if we have the same file in both the filer's directory and the underlying company's directory.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mccgr/edgar/issues/73#issuecomment-604959333, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE7IFNGHPSLJ7DNRF75GIPLRJSHUPANCNFSM4LG4RNNA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mccgr/edgar/issues/73#issuecomment-605805442, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHJ6VXJY53C6ZFDZZ7I3G23RKA26JANCNFSM4LG4RNNA.

iangow commented 4 years ago

I think we can close this one? If the goal is to create a list of CIKs for which we want to get 10-Ks, then you just need to pick CIKs that have CIKs. Done. No need to filter out individuals, etc. Am I missing something?

dingtq commented 4 years ago

@iangow @bdcallen I agree that we can close this one.

mccgr / edgar

Make program which determines which ciks correspond to companies v individuals #73