mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

download item 8 #34

Closed jamespkav closed 5 years ago

jamespkav commented 5 years ago

Thanks Ben.

bdcallen commented 5 years ago
filing_docs  <- tbl(pg, sql("SELECT * FROM edgar.filing_docs"))
>     filing_docs_processed <- tbl(pg, sql("SELECT * FROM edgar.filing_docs_processed"))
>     item8 <- tbl(pg, sql("SELECT DISTINCT file_name FROM edgar.item_no WHERE left(item_no, 1) = '8'"))
> filing_docs_to_get <- filing_docs %>% inner_join(item8, by = "file_name") %>% anti_join(filing_docs_processed, by = "file_name")
> filing_docs_to_get %>% filter(document %~*% "htm$") %>% count()
# Source:   lazy query [?? x 1]
# Database: postgres 9.6.10 [bdcallen@/var/run/postgresql:5432/crsp]
      n
  <dbl>
1    0.

Done. I previously updated filing_docs from the list of item 8 filings not yet in it, then downloaded the html documents, in the same way I did for item 5. It's just finished now.

iangow commented 5 years ago

It might be useful to document the steps taken to make this happen, starting from updating edgar.filings through to downloading the .htm files. The easiest way to do this for future tasks is to relate the commits to the associated issue.