Examine cases with shared accession numbers.

iangow commented 7 years ago

From @iangow on April 9, 2017 16:29

shared_ciks <-
    acc_nos %>%
    inner_join(acc_nos, by="accession_no", suffix = c("_x", "_y")) %>%
    filter(cik_x != cik_y) %>%
    compute()

shared_ciks %>%
    group_by(cik_x, cik_y) %>%
    summarize(num_filings = n())

Copied from original issue: iangow/filings#5

iangow commented 6 years ago

More code here:

Sys.setenv(PGHOST="iangow.me", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL)

pg <- dbConnect(PostgreSQL())

rs <- dbExecute(pg, "SET search_path TO edgar")
rs <- dbExecute(pg, "SET work_mem = '5GB'")

accession_numbers <- tbl(pg, "accession_numbers")
filings <- tbl(pg, "filings")

acc_nos <-
    accession_numbers %>%
    inner_join(filings, by = "file_name") %>%
    select(file_name, cik, accessionnumber) %>%
    # mutate(cik = as.integer(cik)) %>%
    compute(indexes = "cik")

multiple_ciks <-
    acc_nos %>% 
    group_by(accessionnumber) %>% 
    summarize(num_ciks = n_distinct(cik)) %>% 
    filter(num_ciks > 1) %>%
    compute()

multiple_ciks %>% 
    inner_join(acc_nos) %>% 
    arrange(accessionnumber)

rs <- dbDisconnect(pg)

bdcallen commented 5 years ago

@iangow It seems cases, like an 8-K or a 10-K for example, for which an accession number only has one filing, correspond to cases where the filing is only made under the CIK for a single filer.

In contrast, for cases where an accession number appears multiple times, there a multiple things that can happen. In some cases, there seems to be a file_name not only for the filer, but also for a subject or issuer (eg. in SC 13D/G, or Forms 3, 4, 5). An example of this for a SC 13D is this url for the filer and this url for the subject. An example for a Form 4 is this for the issuer and this for the reporting owner. Other cases where an accession number appears multiple times seem to correspond to cases where there are multiple filers, or multiple entities for which the filing has been filed on their behalf. A rather extreme example is the case with accesssion number 0000000000-11-030218

crsp=> SELECT COUNT(DISTINCT(file_name)) FROM edgar.accession_numbers WHERE accessionnumber = '0000000000-11-030218';
 count 
-------
   206
(1 row)

crsp=> SELECT COUNT(*) FROM edgar.accession_numbers WHERE accessionnumber = '0000000000-11-030218';
 count 
-------
   206
(1 row)

for which one of the urls can be found here. There, the large number of multiple agents for whom the filing is being filed for is clear.

bdcallen commented 5 years ago

@iangow This seems to raise a potential issue of duplication. I just did this

sql = """SELECT accessionnumber, COUNT(*) AS num FROM edgar.accession_numbers 
         GROUP BY accessionnumber 
         HAVING COUNT(*) > 1
      """

df = pd.read_sql(sql, engine)

and

df.shape[0]

gave 4898433, while

df['num'].sum()

yielded 10992164. Another useful way of seeing this, which I did, is by doing

crsp=> SELECT COUNT(DISTINCT(accessionnumber)) FROM edgar.accession_numbers;
  count   
----------
 12585414
(1 row)

crsp=> SELECT COUNT(*) FROM edgar.accession_numbers;
  count   
----------
 18679145
(1 row)

So in other words, to produce filing_docs and then download the documents, perhaps we are downloading about 50% more than we actually need to. In the context of EDGAR, that is a lot, especially considering that each filing has multiple filing_docs. And perhaps the identifier for a unique record should not be file_name from filings, but the accessionnumber. Anyways, just food for thought for now.

bdcallen commented 5 years ago

@iangow Perhaps the accessionnumber should be just made a column of filings, then filing_docs constructed from unique accession numbers, if we go further along the path I just suggested.

bdcallen commented 5 years ago

@iangow With the above two posts, I should mention that the reason I made them was that with accession numbers with multiple file names, the corresponding urls seem to lead to the same filing docs.

iangow commented 4 years ago

I will just close this issue.

mccgr / edgar

Examine cases with shared accession numbers. #2