nytimes / Fech

Deprecated. Please see https://github.com/dwillis/Fech for a maintained fork.
http://nytimes.github.io/Fech/
Other
115 stars 30 forks source link

How do I grab all the data? #38

Closed saizai closed 11 years ago

saizai commented 12 years ago

I'd like to replicate the entire FEC filings database.

Currently I'm only scraping the committee & candidate master files (http://www.fec.gov/finance/disclosure/ftpdet.shtml).

  1. Is that totally mooted by the filings?
  2. Where can I get a list of all filings issued to date, or at least the first filing number? http://query.nictusa.com/rss/ can keep me up to date once I do have it all, but I need to get there first.
  3. How do you deal with amendments, bad data, etc? For instance, just from the master files, I note that there are lots of committees or candidates that cross-reference each other wrongly; lots of variance in how "no data" is represented (argh why can't people learn to use nil); various exceptions that broke my validations (eg an undocumented committee type "O" and candidate status "Q").

It seems that NYC's CampaignCash API does at least some of this processing; it would be nice if that code could be open sourced as well, since essentially I am replicating it. (Why: I just don't want to have a dependency on NYT's database. I have to store a bunch of it locally anyway, so I'd rather just have the whole damn thing locally and be able to run my own queries.)

dwillis commented 12 years ago

There are a lot of questions in here, so let me see if I can unpack some of them for you.

  1. The committee and candidate master files relate to the filings, but are separate from them. In other words, if you want replicate the entire FEC filings database, you would need to also have the committee files as well (and probably the candidate files, too, for reference at least). It's not as if the filings don't contain the information within the committee and candidate files, but that having them around makes it easier to answer specific questions about candidates and committees.
  2. A complete collection of the electronic filings is available here: ftp://ftp.fec.gov/FEC/electronic/, where daily zip files store all of the filings made on that date.
  3. Bad data is a very difficult issue. Fech doesn't attempt to deal with it very much at all, since the electronic filings are assumed to be "unofficial" by the FEC and therefore don't contain proper cross-references for other committees, as you've found.

Amendments are less of an issue, since the FEC's policy is that amendments fully replace the original version of a filing. So in the NYT API, for example, filings that subsequently have been amended are marked as such in responses. We do deal with this in our campaign finance app, as you surmise, and I have little objection to open sourcing that code, but right now it's not quite ready for prime time.

dwillis commented 12 years ago

Also, bear in mind that if you really mean a "complete" collection of filings, that includes filing versions that Fech does not support (prior to version 3).

saizai commented 12 years ago

I don't mind non-primetime code; if anything I would rather help you work on it than make my own. ;-) For instance, it'd definitely help to have some reasonable activerecord migrations and validation / collation logic to work from.

Is ftp://ftp.fec.gov/FEC/electronic actually a full listing of all (electronic) FEC filings created? I thought it was only partial.

How much do the master files overlap with the filing files? One concern for instance is how to get data for committees that don't file electronically; it's not even clear to me whether that's available online in any automation-friendly way.

Something that I want to do is to extend this code to also work for creating and uploading filings to the FEC, as well as to handle non-FEC electronic filings (in both directions) like California's, since it has essentially most of the same kinds of technical requirements aside from some minor differences in its actual formatting. (See http://www.sos.ca.gov/prd/electronic-filing-info/release-letter.htm for tech details.)

I probably wouldn't code a fully general creation/uploading component myself — my needs are for operating a nonconnected mostly-conduit-only PAC — but at least it'd be a start. This is also why I want to have relatively complete records — so that I can be able to list as potential recipients even relatively small PACs/candidates without having to do a whole bunch of manual work to enter and validate them.

saizai commented 12 years ago

Also from the FEC's data blog, it looks like they just change the format of their committee/candidate master files. Which means my scraping script is probably broken, blah…

saizai commented 12 years ago

And curious: going through all those zip files, there are some which have the same filename in more than one file. I wonder why that is; I thought they were supposed to be totally atomic.

dwillis commented 12 years ago

AFAIK, the ftp electronic filing files are a full list of those filings starting in 2001; there were a handful of filings submitted before that as part of a trial run by the FEC, but generally the electronic filing era begins in 2001. It is not a full listing of ALL filings, since Senate candidates and the two senatorial party committees (DSCC & NRSC) file on paper. However, all committees, regardless of how they file, are represented in the master files. Same with candidates. And the FEC types in contributions to Senate committees into the FTP version of its itemized data, so you could grab that as part of your process and integrate the two sources of data.

However, I don't have any plans for Fech to support this, as the FTP files are enormous and for other reasons. The same applies for Fech support for state-based systems; you're welcome to put that into your fork, of course, but at this point I'm not inclined to have that as part of the main library. And the same goes for uploading files to the FEC; I'm not interested in becoming an FEC vendor.

The change in formatting for the cmte/candidate master files goes into effect at the end of July, so you still have a few more weeks of working scrapers :-).

I wasn't aware that some of the zip files have filings with the same name - can you give me an example?

saizai commented 12 years ago

Could you point me specifically to what data files are involved in non-electronic itemized data? There are a lot of files on that ftp server. :-P

Would you be interested in adding master file scraping to this gem? That at least seems substantially related.

Why not add FEC file creation, uploading and PDF conversion? They might not be things you want to maintain, but you wouldn't necessarily have to; just label them "as is, please patch". :-P Even if I'm the one writing it, there are benefits to centralization, so that other people would hopefully start using and improving it also — and code wise it's a very substantial overlap, given that it's really just the reverse of what Fech currently does (i.e. turn a hash into a data file). It'd also make my continuing to contribute easier — and note that I just sent you 3 feature and 2 bug pulls, which is hopefully some evidence of good faith. ;-)

Zip collisions (I think all with are colliding w/ the preceeding zip, and not complete - I switched to default overwrite):

20010720.zip 16852.fec 20020201.zip 26355.fec 20020315.zip 28579.fec 20020404.zip 30957.fec 20020521.zip 37249.fec 20020626.zip 40089.fec 20020703.zip 40479.fec 20020816.zip 46951.fec 46951.fec 20020905.zip 49143.fec 20021011.zip 52870.fec 52871.fec 52872.fec 20021019.zip 56632.fec 56632.fec 20021024.zip 59021.fec 59022.fec 59023.fec 59024.fec 59025.fec 20021028.zip 61086.fec 61086.fec 61086.fec 20021029.zip 61469.fec 20021031.zip 62284.fec 20021101.zip 62653.fec 20021103.zip 63305.fec 20030102.zip 69823.fec 20030107.zip 70190.fec 20030114.zip 68934.fec 20030124.zip 72047.fec 20030129.zip 72966.fec

dwillis commented 12 years ago

I'll have more for you later, but the FTP files that contain non-electronic itemized data are the individual contribution and committee contribution files here: http://www.fec.gov/finance/disclosure/ftpdet.shtml The non-electronic itemizations are mixed with the electronic ones in those files, which are updated weekly.

dwillis commented 12 years ago

And the filing submission/uploading is a non-starter for me. We're committed to maintaining all of the code in Fech, and becoming an approved vendor is not a goal of this project. I can see a limited case for scrapers for some of the master files, though. We should discuss it more.

saizai commented 12 years ago

Hm. How easy is it to at least cross-track the contribution master files with the f3x filings? I haven't tried scraping them yet, just the cand/comm masters (https://github.com/MakeYourLaws/MakeYourLaws/tree/master/app/models/fec - rough but it works; made it before finding out about Fech).

At the least we share a common goal of robust data collection. ;-)

dwillis commented 12 years ago

Both sets of data have two common fields, the filing id and the transaction id, so cross-referencing is pretty easy for F3s. You should check out the data dictionary for the individual and committee FTP files. They call the fields FILE_NUM and TRAN_ID, respectively.

dwillis commented 12 years ago

But more to the point of your question about having Fech be able to parse the FTP master files, our position has been that Fech should focus on doing one thing well - parsing electronic filings. Because of the data issues you've pointed out, and various editorial judgments involved in cross-referencing or aggregating data, functionality like that belongs in a separate library. We do not want to impose our conventions on others, which is why Fech doesn't "fix" data or make other decisions.