openva / crump

A parser for the Virginia State Corporation Commission's business registration records.
https://vabusinesses.org/
MIT License
20 stars 3 forks source link

Optimize the code #104

Open waldoj opened 9 years ago

waldoj commented 9 years ago

This just takes too long to run. It's not a problem (except when debugging), but it just ain't right. Figure out how to speed this up. It shouldn't be hard.

waldoj commented 8 years ago

This has become a more pressing need. For the geocoder, too.

waldoj commented 8 years ago

Maybe we could use the date fields? If the date hasn't changed since date X, ignore the record? That would speed up Crump a lot.

waldoj commented 8 years ago

I just realized how to make the geocoder a lot faster—actually use the --since flag. That'll just require a simple modification to the update script, to pass the date (minus 8 days) to it.

waldoj commented 8 years ago

OK, the geocoder has been sped up markedly with that change.

waldoj commented 8 years ago

Maybe compare the file to the prior week's file, and diff them? Then only operate on the diff?

waldoj commented 8 years ago

comm might be the tool for that, but it requires that the files be sorted. And sorting a multi-million-line file isn't nothing.

waldoj commented 8 years ago

I took the raw data from March and the data from this week, sorted each file, and used comm to compare them. (This was quite fast—the whole operation took maybe 3 minutes.) This found 493,790 lines that were unique to one file or the other. That's out of 1,809,192 lines. A record that changes would be listed twice—once from the old file and once from the new file. I think it's plausible that 13.5% of all records changed in an 8-month span. That's far more than I would have guessed, but I think it's within the realm of possibility. FWIW, there are 33,358 new records, so that's 230,216 records changed, or 12.7% of them. If only 19% of all businesses within these records are still active, then we'd expect 12.7% of them to change over a given 8-month span, even if just the date stamps of each record.

So if this data is OK—and it may well be—the next challenge is to figure out what to do with it.