Closed saizai closed 11 years ago
There are a lot of questions in here, so let me see if I can unpack some of them for you.
Amendments are less of an issue, since the FEC's policy is that amendments fully replace the original version of a filing. So in the NYT API, for example, filings that subsequently have been amended are marked as such in responses. We do deal with this in our campaign finance app, as you surmise, and I have little objection to open sourcing that code, but right now it's not quite ready for prime time.
Also, bear in mind that if you really mean a "complete" collection of filings, that includes filing versions that Fech does not support (prior to version 3).
I don't mind non-primetime code; if anything I would rather help you work on it than make my own. ;-) For instance, it'd definitely help to have some reasonable activerecord migrations and validation / collation logic to work from.
Is ftp://ftp.fec.gov/FEC/electronic actually a full listing of all (electronic) FEC filings created? I thought it was only partial.
How much do the master files overlap with the filing files? One concern for instance is how to get data for committees that don't file electronically; it's not even clear to me whether that's available online in any automation-friendly way.
Something that I want to do is to extend this code to also work for creating and uploading filings to the FEC, as well as to handle non-FEC electronic filings (in both directions) like California's, since it has essentially most of the same kinds of technical requirements aside from some minor differences in its actual formatting. (See http://www.sos.ca.gov/prd/electronic-filing-info/release-letter.htm for tech details.)
I probably wouldn't code a fully general creation/uploading component myself — my needs are for operating a nonconnected mostly-conduit-only PAC — but at least it'd be a start. This is also why I want to have relatively complete records — so that I can be able to list as potential recipients even relatively small PACs/candidates without having to do a whole bunch of manual work to enter and validate them.
Also from the FEC's data blog, it looks like they just change the format of their committee/candidate master files. Which means my scraping script is probably broken, blah…
And curious: going through all those zip files, there are some which have the same filename in more than one file. I wonder why that is; I thought they were supposed to be totally atomic.
AFAIK, the ftp electronic filing files are a full list of those filings starting in 2001; there were a handful of filings submitted before that as part of a trial run by the FEC, but generally the electronic filing era begins in 2001. It is not a full listing of ALL filings, since Senate candidates and the two senatorial party committees (DSCC & NRSC) file on paper. However, all committees, regardless of how they file, are represented in the master files. Same with candidates. And the FEC types in contributions to Senate committees into the FTP version of its itemized data, so you could grab that as part of your process and integrate the two sources of data.
However, I don't have any plans for Fech to support this, as the FTP files are enormous and for other reasons. The same applies for Fech support for state-based systems; you're welcome to put that into your fork, of course, but at this point I'm not inclined to have that as part of the main library. And the same goes for uploading files to the FEC; I'm not interested in becoming an FEC vendor.
The change in formatting for the cmte/candidate master files goes into effect at the end of July, so you still have a few more weeks of working scrapers :-).
I wasn't aware that some of the zip files have filings with the same name - can you give me an example?
Could you point me specifically to what data files are involved in non-electronic itemized data? There are a lot of files on that ftp server. :-P
Would you be interested in adding master file scraping to this gem? That at least seems substantially related.
Why not add FEC file creation, uploading and PDF conversion? They might not be things you want to maintain, but you wouldn't necessarily have to; just label them "as is, please patch". :-P Even if I'm the one writing it, there are benefits to centralization, so that other people would hopefully start using and improving it also — and code wise it's a very substantial overlap, given that it's really just the reverse of what Fech currently does (i.e. turn a hash into a data file). It'd also make my continuing to contribute easier — and note that I just sent you 3 feature and 2 bug pulls, which is hopefully some evidence of good faith. ;-)
Zip collisions (I think all with are colliding w/ the preceeding zip, and not complete - I switched to default overwrite):
20010720.zip 16852.fec 20020201.zip 26355.fec 20020315.zip 28579.fec 20020404.zip 30957.fec 20020521.zip 37249.fec 20020626.zip 40089.fec 20020703.zip 40479.fec 20020816.zip 46951.fec 46951.fec 20020905.zip 49143.fec 20021011.zip 52870.fec 52871.fec 52872.fec 20021019.zip 56632.fec 56632.fec 20021024.zip 59021.fec 59022.fec 59023.fec 59024.fec 59025.fec 20021028.zip 61086.fec 61086.fec 61086.fec 20021029.zip 61469.fec 20021031.zip 62284.fec 20021101.zip 62653.fec 20021103.zip 63305.fec 20030102.zip 69823.fec 20030107.zip 70190.fec 20030114.zip 68934.fec 20030124.zip 72047.fec 20030129.zip 72966.fec
I'll have more for you later, but the FTP files that contain non-electronic itemized data are the individual contribution and committee contribution files here: http://www.fec.gov/finance/disclosure/ftpdet.shtml The non-electronic itemizations are mixed with the electronic ones in those files, which are updated weekly.
And the filing submission/uploading is a non-starter for me. We're committed to maintaining all of the code in Fech, and becoming an approved vendor is not a goal of this project. I can see a limited case for scrapers for some of the master files, though. We should discuss it more.
Hm. How easy is it to at least cross-track the contribution master files with the f3x filings? I haven't tried scraping them yet, just the cand/comm masters (https://github.com/MakeYourLaws/MakeYourLaws/tree/master/app/models/fec - rough but it works; made it before finding out about Fech).
At the least we share a common goal of robust data collection. ;-)
Both sets of data have two common fields, the filing id and the transaction id, so cross-referencing is pretty easy for F3s. You should check out the data dictionary for the individual and committee FTP files. They call the fields FILE_NUM and TRAN_ID, respectively.
But more to the point of your question about having Fech be able to parse the FTP master files, our position has been that Fech should focus on doing one thing well - parsing electronic filings. Because of the data issues you've pointed out, and various editorial judgments involved in cross-referencing or aggregating data, functionality like that belongs in a separate library. We do not want to impose our conventions on others, which is why Fech doesn't "fix" data or make other decisions.
I'd like to replicate the entire FEC filings database.
Currently I'm only scraping the committee & candidate master files (http://www.fec.gov/finance/disclosure/ftpdet.shtml).
It seems that NYC's CampaignCash API does at least some of this processing; it would be nice if that code could be open sourced as well, since essentially I am replicating it. (Why: I just don't want to have a dependency on NYT's database. I have to store a bunch of it locally anyway, so I'd rather just have the whole damn thing locally and be able to run my own queries.)