michalgm / state_dem

Oil Change International State DEM Project
1 stars 0 forks source link

Alaska needs more data cleaning #30

Closed skyebend closed 11 years ago

skyebend commented 11 years ago

I see nodes for "President" "Self" "Executive" "administrator" "self-employed" "retired" "N/A" Didn't we get rid of all of these already?

michalgm commented 11 years ago

sigh - ok - here's the deal: those are all coded as e11% or e1210, so they belong in the database. The issue for most of them is that they are getting the wrong field coded as the company name. in the nimsp data, the company name could be any one of 4 fields. I already go through a complicated process to fill in the companyname field, by stepping through each one in order of its usefulness and seeing if it matches any known companies:

dbwrite("update contributions_dem a join $companies_table on parent_organization_name = name set a.company_name = name, a.company_id = match_id where company_id is null and ignore_all_contribs = 0 and match_contribs_on_name = 1");

This is repeated for organization_name, contributor_employer, and contributor_occupation. After that, it's assumed to be an unknown company, and i fill in the first field that's not blank. However, some records have all those fields as blank, but are still somehow coded by nimsp, so then I have an even more complicated process to try and get some sort of useful value from the contributor_name field.

Anyway, the issue is that things like 'SELF' and 'CEO' are known company names, and thus are getting filled in before it gets to something more useful. The correct thing to do would be to add "and match_contribs_on_name=1" to those queries (and then probably do another pass of the same thing without match_contribs_on_name). However, that field is only flagged as true for 963 companies. For 10544, it is set to 0, since that is our default for new companies.

i have a few ideas on how to get around this, but none of them are great...

anyway, when you get a chance, call or text me (Skye)? I'd like to talk this through.

On 07/25/2013 02:13 PM, Skye Bender-deMoll wrote:

I see nodes for "President" "Self" "Executive" "administrator" "self-employed" "retired" "N/A" Didn't we get rid of all of these already?

— Reply to this email directly or view it on GitHub https://github.com/michalgm/state_dem/issues/30.

michalgm commented 11 years ago

Ok - I've resolved this by flagging all these sorts of names as 'non_company_name' and altering the import queries to avoid those names.