mysociety / popit-api

DEPRECATED - Development on PopIt has stopped and it is no longer being maintained
https://goo.gl/Vvej4Q
Other
17 stars 3 forks source link

add a name resolution endpoint to the PopIt API to help with identifying people from scraped documents #70

Open mhl opened 10 years ago

mhl commented 10 years ago

One of the aims of PopIt was that it should be able to help with the very common task of taking a (possibly rather mangled) version of someone's name from a transcript (typically also with the date) and return the PopIt person it probably represents.

We have code that does this kind of name matching all over the place, with various techniques used, including:

(Please add other suggestions, these are just off the top of my head.)

It would be great if we could take the benefit of our experience with this task and make that available to anyone via PopIt - it would also be a really tangible benefit to storing people data in PopIt.

For example, forgetting for the moment about how you would implement this efficiently, I would hope that the following would work - for example, with this person in PopIt, to take an example I was working on name resolution of recently, they should be returned by a search for:

In the Scottish Parliament there are multiple ways of disambiguating people based on party or constituency - for example, names of people in a division might be represented in any of these ways:

It would be very helpful to be able to include constituency and party information in party of the query. So perhaps the name resolution endpoint could take the following input:

Or instead of the templates parameter, you could make multiple queries, and instead have party and area parameters which are handled specially - I quite like the idea of being able to specify the known variants of how names are presented to the matcher, though, and letting it deal with that.

chrismytton commented 10 years ago

Some notes from team meeting session about this:

dracos commented 10 years ago

More parlparse scripts: sp/resolvemembernames.py ni/resolveninames.py lords/resolvelordsnames.py (probably others I've forgotten).

mlandauer commented 9 years ago

:+1: this would be wonderful addition and just the kind of thing that would tip me over the edge towards using popit - until i did parliamentary scraping i had no idea how painful name matching could be. Making that go away would be incredible. ;-)

martinszy commented 9 years ago

Hi, are you thinking of this as a multi-language sollution? In our case, for instance, the same letter could be used with accent or without, as in the name María, it could also be Maria, depending on who wrote the document. The most common characters in spanish for this case are: á = a, é = e, í = i, ó = o and ú = u

2014-10-30 23:05 GMT+00:00 Matthew Landauer notifications@github.com:

[image: :+1:] this would be wonderful addition and just the kind of thing that would tip me over the edge towards using popit - until i did parliamentary scraping i had no idea how painful name matching could be. Making that go away would be incredible. ;-)

— Reply to this email directly or view it on GitHub https://github.com/mysociety/popit-api/issues/70#issuecomment-61184917.

Martín Szyszlican Desarrollo web usable y accesible martinszyszlican.com

pudo commented 9 years ago

Hey, just came acrosss this topic, which is a really interesting discussion. I've been struggling with this for a while, building two services, nomenklatura and opennames.org. Neither of them is nearly as good as I'd prefer them to be, I've come to the conclusion that we need to look more at multi-attribute matching the way SILK does it.

In any case, some interesting links:

pudo commented 9 years ago

Oh yeah, and: please consider implementing the OpenRefine reconciliation API, it's not a ton of stuff and really helps clean up source data!

fgregg commented 9 years ago

@pudo, ouch! Sorry to hear that about dedupe, we have been trying to make the API easier. Here is the github repo that @dwillis refers to in his talk: https://github.com/dwillis/other-people

pudo commented 9 years ago

Hey @fgregg ! I meant this as a compliment, now that I read it again it sounds a bit dismissive. So sorry, I have the utmost respect for the work you've been doing on dedupe.

jpmckinney commented 9 years ago

I think @pudo is saying that dedupe is brilliant work.

akuckartz commented 9 years ago

Maybe interesting in the context of this issue:

Apache Stanbol contains several Named Entity Recognition (NER) Engines: http://stanbol.apache.org/docs/trunk/components/enhancer/engines/list.html