Closed iangow closed 4 years ago
More code here:
Sys.setenv(PGHOST="iangow.me", PGDATABASE="crsp")
library(dplyr, warn.conflicts = FALSE)
library(RPostgreSQL)
pg <- dbConnect(PostgreSQL())
rs <- dbExecute(pg, "SET search_path TO edgar")
rs <- dbExecute(pg, "SET work_mem = '5GB'")
accession_numbers <- tbl(pg, "accession_numbers")
filings <- tbl(pg, "filings")
acc_nos <-
accession_numbers %>%
inner_join(filings, by = "file_name") %>%
select(file_name, cik, accessionnumber) %>%
# mutate(cik = as.integer(cik)) %>%
compute(indexes = "cik")
multiple_ciks <-
acc_nos %>%
group_by(accessionnumber) %>%
summarize(num_ciks = n_distinct(cik)) %>%
filter(num_ciks > 1) %>%
compute()
multiple_ciks %>%
inner_join(acc_nos) %>%
arrange(accessionnumber)
rs <- dbDisconnect(pg)
@iangow It seems cases, like an 8-K or a 10-K for example, for which an accession number only has one filing, correspond to cases where the filing is only made under the CIK for a single filer.
In contrast, for cases where an accession number appears multiple times, there a multiple things that can happen. In some cases, there seems to be a file_name
not only for the filer, but also for a subject or issuer (eg. in SC 13D/G, or Forms 3, 4, 5). An example of this for a SC 13D is this url for the filer and this url for the subject. An example for a Form 4 is this for the issuer and this for the reporting owner. Other cases where an accession number appears multiple times seem to correspond to cases where there are multiple filers, or multiple entities for which the filing has been filed on their behalf. A rather extreme example is the case with accesssion number 0000000000-11-030218
crsp=> SELECT COUNT(DISTINCT(file_name)) FROM edgar.accession_numbers WHERE accessionnumber = '0000000000-11-030218';
count
-------
206
(1 row)
crsp=> SELECT COUNT(*) FROM edgar.accession_numbers WHERE accessionnumber = '0000000000-11-030218';
count
-------
206
(1 row)
for which one of the urls can be found here. There, the large number of multiple agents for whom the filing is being filed for is clear.
@iangow This seems to raise a potential issue of duplication. I just did this
sql = """SELECT accessionnumber, COUNT(*) AS num FROM edgar.accession_numbers
GROUP BY accessionnumber
HAVING COUNT(*) > 1
"""
df = pd.read_sql(sql, engine)
and
df.shape[0]
gave 4898433, while
df['num'].sum()
yielded 10992164. Another useful way of seeing this, which I did, is by doing
crsp=> SELECT COUNT(DISTINCT(accessionnumber)) FROM edgar.accession_numbers;
count
----------
12585414
(1 row)
crsp=> SELECT COUNT(*) FROM edgar.accession_numbers;
count
----------
18679145
(1 row)
So in other words, to produce filing_docs
and then download the documents, perhaps we are downloading about 50% more than we actually need to. In the context of EDGAR, that is a lot, especially considering that each filing has multiple filing_docs. And perhaps the identifier for a unique record should not be file_name
from filings
, but the accessionnumber
. Anyways, just food for thought for now.
@iangow Perhaps the accessionnumber
should be just made a column of filings
, then filing_docs
constructed from unique accession numbers, if we go further along the path I just suggested.
@iangow With the above two posts, I should mention that the reason I made them was that with accession numbers with multiple file names, the corresponding urls seem to lead to the same filing docs.
I will just close this issue.
From @iangow on April 9, 2017 16:29
Copied from original issue: iangow/filings#5