Develop a system for standardizing contributor names

palewire commented 9 years ago

Ala this

miketahani commented 9 years ago

Just saw this on NICAR-L, might be relevant if you're willing to pull in a third-party dependency.

https://github.com/datamade/probablepeople

palewire commented 9 years ago

We're open to tools like that but in addition to solutions to tricky problems, like parsing names, we also need a framework or superstructure on top of our models that makes it happen. That's what we're looking for here.

JoeGermuska commented 9 years ago

I've always figured that a step like this requires human review. Is that part of what you mean by "framework" or "superstructure"? Or do you think that something automated could be good-enough?

palewire commented 9 years ago

Sure, that and whatever intermediary steps will need to go into the process. This area is @anthonyjpesce's interest so he knows more about it than me.

Here are things that jump to my mind:

Since there is no canonical donor UID, we'll need to work out what fields taken together we want to treat as defining "uniqueness." Does this include occupation and employer? Do those get standardized separately? Nicknames? Etc.
We'll then need to script a routine that pulls that unique list from the raw data
Then some ingestion routine that pulls that list into the standardization system
We'll then need to develop the actual g.d. system for combining mutations, which should probably have some kind of supervised sample or transparent method we can document, tweak, improve, etc.
Then we need some way to reconcile the outputs of that back to the line-item data to be used in analysis and presentation
Then the whole thing has to be automated so it can update on a rolling basis as new data arrives

I doubt all that stuff needs to ultimately live in this repository. But one thing I think this repo can work our right away is how to write the proper database query to pull out the raw materials necessary to do the sexy stuff.

rkiddy commented 9 years ago

I have started to do something here. I will have to put together a pull request when it is workable. This is not yet very efficient. And should this actually be in the browser project and not the raw-data project?

My approach so far has been to:

1) Create a names table. Right now this has the columns:

id - int, primary key
ext_table - varchar, the name of the table from which the name comes
ext_pref - varchar, the prefix of the columns, "filer" or "intr" or ...
ext_pk - int, the pk in the table
naml - varchar
namf - varchar
nams - varchar
namt - varchar
name - varchar

Example rows are:

*************************** 1. row ***************************
        id: 40010010
    ext_pk: 2605911
 ext_table: EXPN_CD
ext_prefix: PAYEE
      naml: CompleteCampaigns.com a Div. of Aristotle International Inc.
      namf: 
      nams: 
      namt: 
      name: CompleteCampaigns.com a Div. of Aristotle International Inc.
*************************** 2. row ***************************
        id: 1
    ext_pk: 1
 ext_table: CVR2_LOBBY_DISCLOSURE_CD
ext_prefix: ENTY
      naml: Hall
      namf: Joan
      nams: 
      namt: Ms.
      name: Ms. Joan Hall

I am creating the name column by a concat of namt, namf, naml and nams and then I trim it and replace multiple spaces with one space. And wow, does this suck up time and cpu cycles.

I think it may be worth adding an 'is_person' column of some kind and using probablepeople to fill that in. That then can be fed to a de-duplicator. I have looked at https://github.com/datamade/dedupe but it is a lot to figure out. For one thing, dedupe may already use probablepeople.

I then think we need an "identity" table. As a first step, completely identical entries in the name table can be coalesced here. For example, there would be one "Ms. Joan Hall" and all of the "Ms. Joan Hall" entries in the name table, and ultimately in the other tables, would point to this. This table would be for what we think are "identities", as distinct from any particular name.

There are more things to bring up, but this is becoming too much at once.

anthonyjpesce commented 9 years ago

That's a good start. First, I think it's a good idea to keep an analysis like this out of the raw data app -- the browser is probably the best place for now, though we may want to change that in the future.

Second, I think we'll probably have to take more than the name into account. It's a good place to start, but we would see a lot of false positives. Ideally I think we'd look at name, employer, occupation, zip code, and city -- any potentially identifying information.

This wouldn't capture people who file from work and from home, or people who move, or people who have nicknames or misspellings. For anything more sophisticated we would probably have to train a classifier -- but finding instances where all of the fields line up exactly is a good place to begin.

Also, we should be able to tell companies from people pretty easily -- companies typically do not have an employer and occupation listed while individuals do.

palewire commented 9 years ago

FWIW, outside of the packaged app in our developer-only code I've drafted a rough query to dump all the unique contributor names.

I agree that this kind of thing needs to somehow be separate from the packaged apps, or an add-on, or something. I'm not sure I can visualize exactly how our architecture needs to work out but the more we work on this "campaign browser" repository it feels like it might need to be split into two -- one that is "clean data" layer that transforms the "raw data" app, and a second that serves as the user interface.

That's just me spitballing, but the "do one thing and do it well" design philosophy of UNIX tends to be the right approach and I'd like to see how far we can go with it here.

anthonyjpesce commented 9 years ago

+1 to that @palewire. So then it seems to me a good step is to figure out how to keep our data consistent between the raw app and any analysis/browser layer we build on top of it.

Here's my concern: For each filing cycle, a committee can file an amended filing that will replace one or more contributions they've previously reported. Any part of the record(s) could change -- from the name to the amount, the zip code, etc. On the front end of the Cal-Access site, this will wipe all of the contributions and replace them with an updated Transaction ID. @maloym tells me this will replace the whole filing, not just the contributions that changed.

So, as a hypothetical, last week we pulled the raw data and ran an analysis. There's another filing cycle next week which will bring in a bunch of new data -- we'll need to update our analysis to reflect the latest information. We used the raw data app to download the contributions, and we wrote a separate app on top of it that looked for common donors.

If we pull the raw data again and only look at the new records, it's possible we've missed one or hundreds of previous contributions that have been updated with new information.

In the past, we've gotten around this in two ways:

Throw the baby out with the bathwater, wipe the whole thing, and run the analysis again from scratch.
Write a highly convoluted script that attempts to match the old and new records using a series of fields we hope will be unique to each contribution. Use these matches to determine which records were updated, then re-process just the updated records.

Both have obvious drawbacks. So, is there a way for the raw app to alert us when there has been an amended filing, and point out the records that have been changed? Does the data dump store the old and amended records, so we can try to link them and surface this information to an app we're writing on top of it?

I'm probably off topic by now. But any solution that relies on option 1 will not work with millions of rows, and any solution that relies on a user of the raw app implementing a custom solution like option 2 is asking for trouble.

If we can fix this, it will open us up to begin work on any number of analysis apps...

rkiddy commented 9 years ago

I think, @anthonyjpesce, that you are combining two issues here. There are two separate problems. First, how do we correct the received data, clean it up and make it usable. Second, how can we preserve any modifications that are done, since modifications need to be trackable, explainable, and automatically re-applyable against new pulls from the SoS. This issue, as I understand it, is about the first problem. The second problem is real and any solution to the second problem needs to take the first problem into account. And the second problem needs to be solved if we ever want to do incremental updates from the SoS-supplied data, as you mention. But these are separate problems.

FYI, I created an issue (django-calaccess-raw-data#257) about one part of the second question.

I am going to try to create a pull request to create the names table and the identities table. I think there needs to be a table that captures corrections, however they are made. But we need the names and identities tables first.

And if you all want to create a data-corrections repo to do this in a separate code base, that might be a good idea. Or it can just be a separate set of commands in this project, which would be fine also.

palewire / django-calaccess-campaign-browser

Develop a system for standardizing contributor names #173