add a name resolution endpoint to the PopIt API to help with identifying people from scraped documents

mhl commented 10 years ago

One of the aims of PopIt was that it should be able to help with the very common task of taking a (possibly rather mangled) version of someone's name from a transcript (typically also with the date) and return the PopIt person it probably represents.

We have code that does this kind of name matching all over the place, with various techniques used, including:

Generating all possible versions of initials of someone's name to try matching any form
Trying to match versions of the name which include prefix honorifics or miss them out
Ordering by Levenshtein distance to any form of the name
Restricting by party membership or constituency representation if mentioned with the name
Restricting the set of matches to people who were members of the house of parliament (or other organisation) that the transcript was from (and on the particular date of the transcript)
Trying different orders of names (since sometimes they get reversed)
Storing variants of the name in Elasticsearch and using its relevance criteria to rank name matches
etc.

(Please add other suggestions, these are just off the top of my head.)

It would be great if we could take the benefit of our experience with this task and make that available to anyone via PopIt - it would also be a really tangible benefit to storing people data in PopIt.

For example, forgetting for the moment about how you would implement this efficiently, I would hope that the following would work - for example, with this person in PopIt, to take an example I was working on name resolution of recently, they should be returned by a search for:

J Malema (i.e. if you infer just what the first initial would be)
Julius Malema (i.e. taking only the first of given names)
J S Malema (i.e. include all initials, space separated)
JS Malema (i.e. all initials, no spaces separating them)
Sello Malema (some people go by a given name other than the first one)
J Sello Malema (in case multiple surnames have been misentered)
Malema Julius (names reversed, possibly comma-separated)

In the Scottish Parliament there are multiple ways of disambiguating people based on party or constituency - for example, names of people in a division might be represented in any of these ways:

https://github.com/mysociety/parlparse/blob/master/pyscraper/sp/parse-official-reports-new.py#L107-L145

It would be very helpful to be able to include constituency and party information in party of the query. So perhaps the name resolution endpoint could take the following input:

date (optional, if not given, match from any time)
organisation (optional, if supplied but date isn't, they must have been a member of it at any time - if supplied and date is as well, must be at any time
input _string (e.g. Baillie, Jackie (Dumbarton) (Lab))
templates (optional) - perhaps an array of regular expressions, where the names of matching groups can be things like initials, last_name, membership__organization__name (name of a party, say), membership__area__name to take inspiration from django's ORM's way of joining models

Or instead of the templates parameter, you could make multiple queries, and instead have party and area parameters which are handled specially - I quite like the idea of being able to specify the known variants of how names are presented to the matcher, though, and letting it deal with that.

chrismytton commented 10 years ago

Some notes from team meeting session about this:

First version should generate all possible versions of initials, honorific prefix/suffixes, other_names, titles etc from someone's name and puts them into a separate elasticsearch index specifically for name resolution.
PMG's committee parsing is the primary use-case for this
Should eventually aim to eliminate popit_resolver
Measure success by matching against a predefined set of tests
TheyWorkForYou could use this if the MP data was in a PopIt (current parlparse script).

dracos commented 10 years ago

More parlparse scripts: sp/resolvemembernames.py ni/resolveninames.py lords/resolvelordsnames.py (probably others I've forgotten).

mlandauer commented 9 years ago

:+1: this would be wonderful addition and just the kind of thing that would tip me over the edge towards using popit - until i did parliamentary scraping i had no idea how painful name matching could be. Making that go away would be incredible. ;-)

martinszy commented 9 years ago

Hi, are you thinking of this as a multi-language sollution? In our case, for instance, the same letter could be used with accent or without, as in the name María, it could also be Maria, depending on who wrote the document. The most common characters in spanish for this case are: á = a, é = e, í = i, ó = o and ú = u

2014-10-30 23:05 GMT+00:00 Matthew Landauer notifications@github.com:

[image: :+1:] this would be wonderful addition and just the kind of thing that would tip me over the edge towards using popit - until i did parliamentary scraping i had no idea how painful name matching could be. Making that go away would be incredible. ;-)

— Reply to this email directly or view it on GitHub https://github.com/mysociety/popit-api/issues/70#issuecomment-61184917.

Martín Szyszlican Desarrollo web usable y accesible martinszyszlican.com

pudo commented 9 years ago

Hey, just came acrosss this topic, which is a really interesting discussion. I've been struggling with this for a while, building two services, nomenklatura and opennames.org. Neither of them is nearly as good as I'd prefer them to be, I've come to the conclusion that we need to look more at multi-attribute matching the way SILK does it.

In any case, some interesting links:

Notes from a SRCCON session, Hell is data about other people which includes references to many cool libraries used by the NYT.
Refines Clustering in Depth documentation.
DataMade's dedupe library which is too smart for mankind by a depressing margin.

pudo commented 9 years ago

Oh yeah, and: please consider implementing the OpenRefine reconciliation API, it's not a ton of stuff and really helps clean up source data!

fgregg commented 9 years ago

@pudo, ouch! Sorry to hear that about dedupe, we have been trying to make the API easier. Here is the github repo that @dwillis refers to in his talk: https://github.com/dwillis/other-people

pudo commented 9 years ago

Hey @fgregg ! I meant this as a compliment, now that I read it again it sounds a bit dismissive. So sorry, I have the utmost respect for the work you've been doing on dedupe.

jpmckinney commented 9 years ago

I think @pudo is saying that dedupe is brilliant work.

akuckartz commented 9 years ago

Maybe interesting in the context of this issue:

Apache Stanbol contains several Named Entity Recognition (NER) Engines: http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list.html

mysociety / popit-api

add a name resolution endpoint to the PopIt API to help with identifying people from scraped documents #70