Closed bdcallen closed 6 years ago
After running
all_files <-
filings %>%
filter(form_type %in% form_types) %>%
select(file_name)
files_to_read <-
filings %>%
filter(form_type %in% form_types) %>%
select(file_name) %>%
anti_join(item_no, by = "file_name")
all_files <- data.frame(all_files)
files_to_read <- data.frame(files_to_read)
I checked the number of rows of each of these dataframes. I got
> nrow(all_files)
[1] 1504607
and
> nrow(files_to_read)
[1] 3709
So it seems the anti_join is doing the right thing, though I will check more thoroughly.
It might be worth checking that the files to process are all new (based on date_filed
on edgar.filings
).
Also you should be able to just say files_to_read %>% count()
(as.data.frame()
forces the data to come into R).
files_to_read2 <- files_to_read %>% inner_join(filings, by = "file_name")
files_to_read2
produces
# Source: lazy query [?? x 5]
# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
file_name company_name form_type cik date_filed
<chr> <chr> <chr> <int> <date>
1 edgar/data/750574/0001193125-18-138703.txt AUBURN NATIONAL BANCORPORATION, INC 8-K 750574 2018-04-27
2 edgar/data/750574/0001193125-18-156160.txt AUBURN NATIONAL BANCORPORATION, INC 8-K 750574 2018-05-08
3 edgar/data/750574/0001193125-18-160444.txt AUBURN NATIONAL BANCORPORATION, INC 8-K 750574 2018-05-11
4 edgar/data/1362190/0001144204-18-028363.txt AUDIOEYE INC 8-K 1362190 2018-05-15
5 edgar/data/826253/0001213900-18-005748.txt AURA SYSTEMS INC 8-K 826253 2018-05-09
6 edgar/data/1492091/0001492091-18-000009.txt AUSCRETE Corp 8-K 1492091 2018-05-14
7 edgar/data/769397/0000769397-18-000021.txt AUTODESK INC 8-K 769397 2018-05-24
8 edgar/data/1034670/0001193125-18-135138.txt AUTOLIV INC 8-K 1034670 2018-04-26
9 edgar/data/1034670/0001564590-18-009443.txt AUTOLIV INC 8-K 1034670 2018-04-27
10 edgar/data/1034670/0001193125-18-156134.txt AUTOLIV INC 8-K 1034670 2018-05-08
# ... with more rows
I then ran
files_not_to_read <- all_files %>% anti_join(files_to_read, by = "file_name") %>% inner_join(filings, by = "file_name")
files_not_to_read
which produced
# Source: lazy query [?? x 5]
# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
file_name company_name form_type cik date_filed
<chr> <chr> <chr> <int> <date>
1 edgar/data/1135185/0001299933-18-000390.txt ATLAS AIR WORLDWIDE HOLDINGS INC 8-K 1135185 2018-05-03
2 edgar/data/731802/0000731802-18-000018.txt ATMOS ENERGY CORP 8-K 731802 2018-05-02
3 edgar/data/879585/0001104659-18-025228.txt ATN International, Inc. 8-K 879585 2018-04-19
4 edgar/data/879585/0001104659-18-026720.txt ATN International, Inc. 8-K 879585 2018-04-26
5 edgar/data/1488039/0001615774-18-002666.txt ATOSSA GENETICS INC 8-K 1488039 2018-04-17
6 edgar/data/1488039/0001615774-18-002808.txt ATOSSA GENETICS INC 8-K 1488039 2018-04-23
7 edgar/data/1488039/0001615774-18-003978.txt ATOSSA GENETICS INC 8-K 1488039 2018-05-17
8 edgar/data/701288/0001171843-18-003753.txt ATRION CORP 8-K 701288 2018-05-09
9 edgar/data/701288/0001144204-18-030576.txt ATRION CORP 8-K 701288 2018-05-23
10 edgar/data/750574/0001193125-18-128951.txt AUBURN NATIONAL BANCORPORATION, INC 8-K 750574 2018-04-24
# ... with more rows
so the dates of the two complementary sets at least overlap. I then ran
file_to_read_dates <- data.frame(files_to_read2)$date_filed
min(file_to_read_dates)
which gave the result
[1] "2018-04-23"
and then
file_not_to_read_dates <- data.frame(files_not_to_read)$date_filed
min(file_not_to_read_dates)
which produced
[1] "1993-10-29"
Out of interest
max(file_not_to_read_dates)
yielded
[1] "2018-05-25"
So in other words, all the files for which the item numbers need to be read are from the last month or so and are fairly recent. Since the timeline of the files that have been processed is later than that, perhaps the files to be read are ones for which the program failed to download the item numbers for some reason. Perhaps the last thing to check is to run the whole program, compute files_to_read
again, and check that it is an empty data frame.
Awesome. This is clear.
https://github.com/iangow-public/edgar/blob/a7f16db36d97034042fc195efad08e808fb4eb78/get_item_nos.R#L31