Open waldoj opened 9 years ago
This has become a more pressing need. For the geocoder, too.
Maybe we could use the date fields? If the date hasn't changed since date X, ignore the record? That would speed up Crump a lot.
I just realized how to make the geocoder a lot faster—actually use the --since
flag. That'll just require a simple modification to the update script, to pass the date (minus 8 days) to it.
OK, the geocoder has been sped up markedly with that change.
Maybe compare the file to the prior week's file, and diff them? Then only operate on the diff?
comm
might be the tool for that, but it requires that the files be sorted. And sorting a multi-million-line file isn't nothing.
I took the raw data from March and the data from this week, sorted each file, and used comm
to compare them. (This was quite fast—the whole operation took maybe 3 minutes.) This found 493,790 lines that were unique to one file or the other. That's out of 1,809,192 lines. A record that changes would be listed twice—once from the old file and once from the new file. I think it's plausible that 13.5% of all records changed in an 8-month span. That's far more than I would have guessed, but I think it's within the realm of possibility. FWIW, there are 33,358 new records, so that's 230,216 records changed, or 12.7% of them. If only 19% of all businesses within these records are still active, then we'd expect 12.7% of them to change over a given 8-month span, even if just the date stamps of each record.
So if this data is OK—and it may well be—the next challenge is to figure out what to do with it.
This just takes too long to run. It's not a problem (except when debugging), but it just ain't right. Figure out how to speed this up. It shouldn't be hard.