mccgr / edgar

Code to manage data related to SEC EDGAR
31 stars 15 forks source link

Create new version of download_filings.R. #3

Closed iangow closed 7 years ago

iangow commented 7 years ago

My sense is that with the edgar.filing_docs table in hand, we only really need this code, suitably modified.

iangow commented 7 years ago

I think the parent directory returned by this line plus document (see below) should give us a direct link to the document on EDGAR.

igow@igow-z640:~$ psql
psql (9.6.5)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

crsp=# SELECT * FROM edgar.filing_docs LIMIT 10;
 seq | description  |   document   |  type   |  size  |                 file_name                  
-----+--------------+--------------+---------+--------+--------------------------------------------
   1 | DEF 14A      | def14a.htm   | DEF 14A | 283947 | edgar/data/866121/0000912057-00-030142.txt
   2 | G1005693.JPG | g1005693.jpg | GRAPHIC |   7852 | edgar/data/866121/0000912057-00-030142.txt
   3 | G247396.JPG  | g247396.jpg  | GRAPHIC |   9197 | edgar/data/866121/0000912057-00-030142.txt
   4 | G814263.JPG  | g814263.jpg  | GRAPHIC |   9471 | edgar/data/866121/0000912057-00-030142.txt
   5 | G763843.JPG  | g763843.jpg  | GRAPHIC |   7673 | edgar/data/866121/0000912057-00-030142.txt
   6 | G972529.JPG  | g972529.jpg  | GRAPHIC |   3903 | edgar/data/866121/0000912057-00-030142.txt
   7 | G321067.JPG  | g321067.jpg  | GRAPHIC |   5449 | edgar/data/866121/0000912057-00-030142.txt
   8 | G836652.JPG  | g836652.jpg  | GRAPHIC |   5557 | edgar/data/866121/0000912057-00-030142.txt
   9 | G977359.JPG  | g977359.jpg  | GRAPHIC |   5809 | edgar/data/866121/0000912057-00-030142.txt
  10 | G440360.JPG  | g440360.jpg  | GRAPHIC |   5903 | edgar/data/866121/0000912057-00-030142.txt
(10 rows)
iangow commented 7 years ago

Note that code linked to above is used here.

Here's the paradigm:

  1. Get a list of filings to process.
  2. Process filings (i.e., download documents).

In doing step 1, one could create a table (say, processed_documents) that keeps track of filings that have been processed and anti_join that table. Normally it is pretty quick to check if a file exists, so this may not be necessary (i.e., we could just check if file exists and, if not, download). But this calculus may change if we're looking at 1.5 million filings (hence many million documents). We can defer resolution of this for a bit, as it should be pretty easy to produce a processed_documents table just be scanning the directory for downloaded files.

jamespkav commented 7 years ago

Excellent. Will look in to it.

I believe I have an adaptable code that will work from when I re-wrote the prior code to access the 8-Ks. I'll look to change this in line with the paradigm.

I think it will be worth constructing a "downloaded" database. Even searching my system for missing files (when I tried to plug gaps) took some time, so we'll want to avoid that for ease going forward.

iangow commented 7 years ago

Even searching my system for missing files (when I tried to plug gaps) took some time, so we'll want to avoid that for ease going forward.

I'd guess mere seconds for 100,000s of files. Perhaps longer on Windows or depending on how you did it.

But making the processed_documents table should be easy to do. The main thing will be to be careful to check that a download has been successful before adding a document to the table. Also, it may be necessary to be able to build the table from scratch should there be a need to rebuild the archive, etc.

iangow commented 7 years ago

If you put "Relates to #3" in your commits to this repository, then links to those commits will appear here.

iangow commented 7 years ago

Other verbs work (e.g., "Closes #3").

jamespkav commented 7 years ago

Do we want to keep the complete submission text file in the list for download?

iangow commented 7 years ago

In many cases they will be redundant. But in some cases it will be the only document, so necessary. May be easier to just grab them.

jamespkav commented 7 years ago

All good. Will keep them in. Code essentially done. Having trouble executing perfectly. Seem to have a "double lapply" or it doesn't work. Going to sort that...

iangow commented 7 years ago

Let me know when you push the code. OK if it isn't perfect when you do so.

I would focus first on creating a function that can take a single row from the filing_docs table (say file_name and document) and download a filing. Then work out how to lapply (or Map) that function to multiple rows. Then run it for a decent sample size (1e2, 1e3, 1e4). Then perhaps play with mclapply (or mcMap) to see how it goes with multiple threads. Once things are singing there, estimate how long it will take, how much room will be needed, then fire it off.

jamespkav commented 7 years ago

Draft code has been pushed. Seems to work well, but a tad slow due to the poor vectoring I think. Will reexamine after teaching.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Ian Gow notifications@github.com Sent: Tuesday, September 19, 2017 2:04:21 PM To: iangow/edgar Cc: jamespkav; Assign Subject: Re: [iangow/edgar] Create new version of download_filings.R. (#3)

Let me know when you push the code. OK if it isn't perfect when you do so.

I would focus first on creating a function that can take a single row from the filing_docs table (say file_name and document) and download a filing. Then work out how to lapply (or Map) that function to multiple rows. Then run it for a decent sample size (1e2, 1e3, 1e4). Then perhaps play with mclapply (or mcMap) to see how it goes with multiple threads. Once things are singing there, estimate how long it will take, how much room will be needed, then fire it off.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/iangow/edgar/issues/3#issuecomment-330423140, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aav-FnbB5ZCwJeafaJ-zSFikp8QZrwRQks5sjz1FgaJpZM4PbpqN.

iangow commented 7 years ago

Note that the table edgar.filing_docs_processed will be tied to a specific machine where the files are located. So if you wanted to download the files to your machine, you'd probably need "local" versions of edgar.filing_docs and edgar.filing_docs_processed (I put "local" in quotes, because it's not necessary for these tables to live on the same machine, though that may be the sanest approach).

I'm happy to run the download on my machine, but there's probably some set-up to be done.

First, we should get you set up properly with GitHub on that machine so you can use RStudio projects, etc., more seamlessly.

Second, we'd need to set up a shared file location that we can both write to. You could try the following on my machine and see if you can write to that folder.

> raw_directory
[1] "/home/shared/"
> Sys.setenv(EDGAR_DIR="/home/shared")

This might be a temporary location because I will run out of space on that 1TB drive fairly quickly, I'd guess. I have a 2TB drive on that machine that I haven't even formatted.

Later on, we might want to work out how we could sync two "local" partial mirrors of EDGAR.

jamespkav commented 7 years ago

I now have GitHub properly executed on my machines. Took a little sorting but it's all organised. R Studio integrated properly.

Eventually, I presume we want to have a central location for these files, but I'm happy to run it on my machine for now and we can rsync it when we have a location sorted?

What we could do is relocate the 4TB (or is it 8TB) drive we bought for the MCCGR server (now defunct) from that computer to yours, format it, and run it all on there?

jamespkav commented 7 years ago

Re-worked code looks good Ian. Much more elegant on some of the functions and dplyr work.

iangow commented 7 years ago

I now have GitHub properly executed on my machines.

What about on 10.101.13.99? That's what you'd want for seamless creation of a repository on that machine.

Eventually, I presume we want to have a central location for these files, but I'm happy to run it on my machine for now and we can rsync it when we have a location sorted?

You still want to test it first (as described above). How long will it take? How much room will it need? (My experience is that sampling works pretty well for this, but do it by year, as filings have grown in size.)

What we could do is relocate the 4TB (or is it 8TB) drive we bought for the MCCGR server (now defunct) from that computer to yours, format it, and run it all on there?

Maybe. I could always just get another hard drive (external HDDs are generally very slow, especially cheap ones, but some would have the ability to become internal drives).

jamespkav commented 7 years ago

What about on 10.101.13.99? That's what you'd want for seamless creation of a repository on that machine.

I now have github active on 10.101.13.99.

You still want to test it first (as described above). How long will it take? How much room will it need? (My experience is that sampling works pretty well for this, but do it by year, as filings have grown in size.)

I have time sorted at the moment on my home. Ran a sample last night, ran at around 2.9 files per second - not lightening but alright. I'll sort through size issues etc next, and do some sampling on 10.101.13.99. However, it seems I don't have permission to the _processed file. Could you please allow rights then I can test using it?

jamespkav commented 7 years ago

How much room will it need?

As an indication, i just ran a size test on 1 million files and it was 250GB. I believe we have 3.5 million files to trawl through at the moment, plus 1.5x more to come (I think), so we'd be thinking something around 1.5-2.5TB correct?

iangow commented 7 years ago

250 3.5/1 1.5 = 1.3TB. So I think we can just use the 2TB drive for now. (I don't think 10-K plus DEF 14A is too much and that would be the bulk of the data.) I just have to set that drive up.

iangow commented 7 years ago

I ran this, which should suffice:

ALTER TABLE edgar.filing_docs_processed OWNER TO edgar;

A line to this effect should go in the code where we create this table. For sanity, I like to maintain a one-to-one schema = repository = PostgreSQL role relationship. Some times I will have a role like edgar_access if I want to give read-only access to the data. Note that mccgr is a member of edgar and you are a member of mccgr. So we can just add (or remove) from mccgr to get the permissions right.

There probably should be a script to run to give mccgr access to all appropriate repositories. But no hurry on this one.

jamespkav commented 7 years ago

250 3.5/1 1.5 = 1.3TB.

Perhaps I didn't express that right. I think we have an additional 1.5x to add to our starting 1, so it should work out as 250 3.5/1 2.5 = 2.2TB. We may need something slightly more than the 2TB. How large was that drive?

jamespkav commented 7 years ago

I like to maintain a one-to-one schema = repository = PostgreSQL role relationship... There probably should be a script to run to give mccgr access to all appropriate repositories. But no hurry on this one.

Sounds reasonable, although I'm not 100% sure what that means. I'll talk to you about it when you're in and can set up that code as required.

iangow commented 7 years ago

I think we have 1.5 million total and 0.5 million in filing_docs, so perhaps even more.

If we want to be space efficient, we could skip "complete submission text" files whenever there's more than one file in filing_docs for a given file_name. Also we could skip images for now (for most purposes, image processing takes long enough that grabbing them from a web is not a constraint). All this is a bit fiddly, but would accelerate download time as well as saving space.

You should be able to set up the documents for downloading with these restrictions using a little dplyr magic without too much effort.

iangow commented 7 years ago

I like to maintain a one-to-one schema = repository = PostgreSQL role relationship.

There is a PostgreSQL database schema edgar, a GitHub repository edgar, and a PostgreSQL database role (like user) edgar that owns tables in edgar.

jamespkav commented 7 years ago

There is a PostgreSQL database schema edgar, a GitHub repository edgar, and a PostgreSQL database role (like user) edgar that owns tables in edgar.

Got it!

I think we have 1.5 million total and 0.5 million in filing_docs, so perhaps even more.

Sounds about right. I think its 1.3m back to 2000, so another 200,000 files back to 1994.

If we want to be space efficient, we could skip "complete submission text" files whenever there's more than one file in filing_docs for a given file_name. Also we could skip images for now (for most purposes, image processing takes long enough that grabbing them from a web is not a constraint). All this is a bit fiddly, but would accelerate download time as well as saving space.

I'll adjust the code for that. I'll keep the remaining code there, but use # to blank it out.

raw_directory [1] "/home/shared/" Sys.setenv(EDGAR_DIR="/home/shared")

I'll use this to time the download and see.

iangow commented 7 years ago

I'll adjust the code for that. I'll keep the remaining code there, but use # to blank it out.

I think you would be adding code to do this. No need to go crazy with alternative versions of the code. Keep your commits discrete and focused and it's quite possible you could just "revert" a commit to go back to an older version.

jamespkav commented 7 years ago

I've submitted a collection of changes to filter the data set. Had some trouble this time with the git interaction so not sure what it has sent through to you. Might be multiples of the same code.

jamespkav commented 7 years ago

Might we not adjust the code to download for FALSE and "" results for the download file. That way we can try to recapture ones where there might have been errors? Or is that unlikely?

iangow commented 7 years ago

Might we not adjust the code to download for FALSE and "" results for the download file. That way we can try to recapture ones where there might have been errors? Or is that unlikely?

I'm not sure exactly what you mean, but we could delete these cases later on and run the code again (note that the "" cases have already been deleted, but I think they were not .htm files, so will not be downloaded by running the existing code again).

jamespkav commented 7 years ago

I believe we have more or less 3 categories in the column related to the downloaded files. YES, NO, and "". For efficiency, if we allow a download on YES and NO, then each time we run the code, we won't have to delete failed downloads first, it should just try it again until it become a YES. It becomes a little problematic, I guess, if we use small values for n in the code, as it might just grab these over and over again.

iangow commented 7 years ago

You don't need four backslashes where one will do.

You might delete cases where downloaded is FALSE and try again, but if they still fail, then you want to move along and not try to download in the future. So omitting such cases in future download attempts makes sense. At the very least, some investigation of the reasons for failure would be warranted before trying a third time.

jamespkav commented 7 years ago

Ah... didn't realise the one would make it work. Got it now. Still getting used to markdown. True. We'll leave it as is.