mysociety / pombola

GNU Affero General Public License v3.0
65 stars 41 forks source link

develop a new name resolution component #1748

Open mhl opened 9 years ago

mhl commented 9 years ago

Our plan for improving name resolution in Pombola (which has been been a persistent source of hard-to-fix bugs) was to add name resolution to PopIt. @struan worked on this (I think on this branch) but now we've decided to stop development of PopIt this isn't going to be a solution for Pombola any more.

I think the idea of a hosted service to provide a name resolution API based on Popolo data is a good one. (The popit-resolver package used in Pombola has something of the same philosophy.)

Both popit-resolver and @struan's work in popit-api use a similar approach - generate versions of a person's names based on the Popolo data for them (e.g. the initials field, the other names, their party membership, etc.) and store them in Elasticsearch. Retrieval of possible matching people is then a matter of an Elasticsearch query.

A hosted version of this service would potentially be helpful to other groups as well - a new instance could be created based on a URL with Popolo data in some serialization (and regularly sync from that URL) - after that, they'd have a simple API to help do fuzzy matching of names from parliamentary transcripts.

This service could use django-popolo to store the Popolo data; if so, in a post-#1594 world, this service could have two modes of use like SayIt's - either as a Django application used directly, or the hosted service used over an HTTP-based API.

As a developer trying to build a parliamentary monitoring site I want to be able to easily find the person referred to by a name (+ optional party) in a parliamentary transcript on a particular date So that I can find all the speeches by particular politicians

Related: https://github.com/mysociety/pombola/issues/1535

tmtmtmtm commented 9 years ago

Is ElasticSearch actually the right tool for this, or was it just the most convenient originally, because all the PopIt data was already there? This feels like quite a heavy-weight approach, and I'm curious as to whether that's because a simpler version would just end up reimplementing lots of ElasticSearch anyway, or whether there might be a better approach now that we can rethink it from an empty slate.

struan commented 9 years ago

I'm not sure if it's the right tool but it was certainly picked for popit because we were already using it. However, it does have a chunk of stuff with scoring matches built in, although it's not clear to me if that's a blessing as it was fine until the magic didn't work as you expected and then working out why was a pain.