Check that this code is not reprocessing files

bdcallen commented 6 years ago

https://github.com/iangow-public/edgar/blob/a7f16db36d97034042fc195efad08e808fb4eb78/get_item_nos.R#L31

bdcallen commented 6 years ago

After running

all_files <-
    filings %>%
    filter(form_type %in% form_types) %>%
    select(file_name) 

files_to_read <-
    filings %>%
    filter(form_type %in% form_types) %>%
    select(file_name) %>%
    anti_join(item_no, by = "file_name")

all_files <- data.frame(all_files)
files_to_read <- data.frame(files_to_read)

I checked the number of rows of each of these dataframes. I got

> nrow(all_files)
[1] 1504607

and

> nrow(files_to_read)
[1] 3709

So it seems the anti_join is doing the right thing, though I will check more thoroughly.

iangow commented 6 years ago

It might be worth checking that the files to process are all new (based on date_filed on edgar.filings).

iangow commented 6 years ago

Also you should be able to just say files_to_read %>% count() (as.data.frame() forces the data to come into R).

bdcallen commented 6 years ago

files_to_read2 <- files_to_read %>% inner_join(filings, by = "file_name")
files_to_read2

produces

# Source:   lazy query [?? x 5]
# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
   file_name                                   company_name                        form_type     cik date_filed
   <chr>                                       <chr>                               <chr>       <int> <date>    
 1 edgar/data/750574/0001193125-18-138703.txt  AUBURN NATIONAL BANCORPORATION, INC 8-K        750574 2018-04-27
 2 edgar/data/750574/0001193125-18-156160.txt  AUBURN NATIONAL BANCORPORATION, INC 8-K        750574 2018-05-08
 3 edgar/data/750574/0001193125-18-160444.txt  AUBURN NATIONAL BANCORPORATION, INC 8-K        750574 2018-05-11
 4 edgar/data/1362190/0001144204-18-028363.txt AUDIOEYE INC                        8-K       1362190 2018-05-15
 5 edgar/data/826253/0001213900-18-005748.txt  AURA SYSTEMS INC                    8-K        826253 2018-05-09
 6 edgar/data/1492091/0001492091-18-000009.txt AUSCRETE Corp                       8-K       1492091 2018-05-14
 7 edgar/data/769397/0000769397-18-000021.txt  AUTODESK INC                        8-K        769397 2018-05-24
 8 edgar/data/1034670/0001193125-18-135138.txt AUTOLIV INC                         8-K       1034670 2018-04-26
 9 edgar/data/1034670/0001564590-18-009443.txt AUTOLIV INC                         8-K       1034670 2018-04-27
10 edgar/data/1034670/0001193125-18-156134.txt AUTOLIV INC                         8-K       1034670 2018-05-08
# ... with more rows

I then ran

files_not_to_read <- all_files %>% anti_join(files_to_read, by = "file_name") %>% inner_join(filings, by = "file_name")
files_not_to_read

which produced

# Source:   lazy query [?? x 5]
# Database: postgres 9.6.8 [bdcallen@10.101.13.99:5432/crsp]
   file_name                                   company_name                        form_type     cik date_filed
   <chr>                                       <chr>                               <chr>       <int> <date>    
 1 edgar/data/1135185/0001299933-18-000390.txt ATLAS AIR WORLDWIDE HOLDINGS INC    8-K       1135185 2018-05-03
 2 edgar/data/731802/0000731802-18-000018.txt  ATMOS ENERGY CORP                   8-K        731802 2018-05-02
 3 edgar/data/879585/0001104659-18-025228.txt  ATN International, Inc.             8-K        879585 2018-04-19
 4 edgar/data/879585/0001104659-18-026720.txt  ATN International, Inc.             8-K        879585 2018-04-26
 5 edgar/data/1488039/0001615774-18-002666.txt ATOSSA GENETICS INC                 8-K       1488039 2018-04-17
 6 edgar/data/1488039/0001615774-18-002808.txt ATOSSA GENETICS INC                 8-K       1488039 2018-04-23
 7 edgar/data/1488039/0001615774-18-003978.txt ATOSSA GENETICS INC                 8-K       1488039 2018-05-17
 8 edgar/data/701288/0001171843-18-003753.txt  ATRION CORP                         8-K        701288 2018-05-09
 9 edgar/data/701288/0001144204-18-030576.txt  ATRION CORP                         8-K        701288 2018-05-23
10 edgar/data/750574/0001193125-18-128951.txt  AUBURN NATIONAL BANCORPORATION, INC 8-K        750574 2018-04-24
# ... with more rows

so the dates of the two complementary sets at least overlap. I then ran

file_to_read_dates <- data.frame(files_to_read2)$date_filed
min(file_to_read_dates)

which gave the result

[1] "2018-04-23"

and then

file_not_to_read_dates <- data.frame(files_not_to_read)$date_filed
min(file_not_to_read_dates)

which produced

[1] "1993-10-29"

Out of interest

max(file_not_to_read_dates)

yielded

[1] "2018-05-25"

So in other words, all the files for which the item numbers need to be read are from the last month or so and are fairly recent. Since the timeline of the files that have been processed is later than that, perhaps the files to be read are ones for which the program failed to download the item numbers for some reason. Perhaps the last thing to check is to run the whole program, compute files_to_read again, and check that it is an empty data frame.

iangow commented 6 years ago

Awesome. This is clear.

mccgr / edgar

Check that this code is not reprocessing files #19