nytimes / Fech

Deprecated. Please see https://github.com/dwillis/Fech for a maintained fork.
http://nytimes.github.io/Fech/
Other
115 stars 30 forks source link

Dealing with ActBlue (and similar) filings #37

Closed dwillis closed 11 years ago

dwillis commented 12 years ago

Filings by ActBlue, one of the largest conduit committees, are enormous, and can take minutes to parse Schedule A & B (one recent filing has 152k itemized contributions). I've tried grabbing specific fields, but it still takes awhile. Anyone have any ideas?

saizai commented 12 years ago

You only really have three options AFAICT: a) ignore some of the data b) put it in some sort of parallel process so you don't mind that it takes forever c) run it through a profiler, figure out the bottlenecks, and commit an improvement. :-)

Also I'd suggest using activerecord-import if you're using AR. Be sure that the version you get includes the patch I made to it which speeds up imports almost 2x.

dwillis commented 12 years ago

Yeah, we are using ar-import on the backend, but this seems to be more of an issue with Fech iterating over the rows in the filing. So I think I'll try some variations on a) first and go from there.

saizai commented 12 years ago

Don't be afraid of (c).

Here's a preliminary (excerpted the major bits), ruby-prof format:

87.73% 0.50% 386.28 2.22 0.00 384.06 63685 People::NameParser#parse 173.73 53.25 0.00 120.48 63685/63685 People::NameParser#get_name_parts 106.09 0.15 0.00 105.94 63685/63685 People::NameParser#get_title 42.82 0.17 0.00 42.65 63685/63685 People::NameParser#get_suffix

120.24 1.22 0.00 119.01 605330/7280981 People::NameParser#get_name_parts 44.18% 2.40% 194.52 10.56 0.00 183.96 7280981 String#match

87.26 13.18 0.00 74.08 6720823/6882864 Array#each 20.63% 3.04% 90.84 13.39 0.00 77.45 6882864 *Class#new 67.79 67.79 0.00 0.00 6766666/6766666 Regexp#initialize

… so People::NameParser is far overwhelming any other bottlenecks.

saizai commented 12 years ago

https://github.com/mericson/people/blob/master/lib/people.rb seems like it has some things that could be optimized.

For get_name_parts, it seems like a clever reworking of the regex could turn all those multiple regexen (which are each a separate string scan) into a single scan whose ( ) results can be used to determine which pattern got hit.

For get_title & get_suffix, it should not be rebuilding a new Regex each time. That should be at most a fixed class array of initialized regexes; better, a single regex with multiple or'd captures so it does a single scan. That's an easy fix, would make for a 38% speed improvement.

/cc @mericson

saizai commented 12 years ago

… also, its clean function is just plain wrong. Not only does it not support UTF-8, it doesn't even support ASCII like ñ (which is a perfectly legal character in names, like nearly all are).

Its name parsing is also wrong. For instance, my name — "Sai" is my full name; I have no other name. get_name_parts returns my name as all empty strings, even if you pass the no_last_name true parameter.

saizai commented 12 years ago

Turning off the names parser results in a drastic speed improvement, as predicted by the profiling — 34 vs 440 seconds for the same data set on my dev machine.

There's more that could be improved within that, but not a whole lot; I would expect other things like activerecord outweigh it if get_name_parts isn't the bottleneck. (My profiling was done with no activerecord involvement, just reading fec files from disk fully including rows.)

dwillis commented 12 years ago

This is good stuff, thank you! I'll see what our options are for reducing or eliminating the use of the name parser - some filings split the names into fields already.

saizai commented 12 years ago

I would suggest that you also try to just be a good OSS citizen and improve the Person code. :P