Store Twitter user IDs as well as screen names

mhl commented 8 years ago

This pull request adds:

A management command for looking for (and fixing) any mismatches between Twitter screen names (in a ContactDetail with contact_type 'twitter') and user IDs (in an Identifier with scheme 'twitter'). The user ID, if present is treated as the more reliable source (since it is - people can change screen names, and their user ID will remain the same).
Validation of Twitter screen names to check that the named Twitter account for someone exists before letting you save that entry.
Code that looks up and saves the Twitter user ID on adding and editing a candidate.
A twitter_user_id column to CSV output.

The management command (candidates_update_twitter_usernames) could reasonably be run once a day. It finds many Twitter screen names currently associated with people that don't exist, and will print them to standard output, so we should get an email if any such cases appear when this command is run from cron. (We should go through these by hand probably.)

The above changes depend on adding a new configuration option to conf/general,yml, which is TWITTER_APP_ONLY_BEARER_TOKEN

Fixes #271

struan commented 8 years ago

I'm afraid the CSV tests are being broken by the addition of the twitter ID.

struan commented 8 years ago

Modulo the failing tests and the error message issue this looks good.

chrismytton commented 8 years ago

We have a similar problem in EveryPolitician where we need to map Twitter handles to ids. @mhl Are there any obvious steps that we could ticket to inch this towards being a more general solution that we could take advantage of in other projects such as EP?

mhl commented 8 years ago

@chrismytton Perhaps a more generically useful of version of this would be an API for a service which:

You could tell to start tracking a particular Twitter user ID from a screen name or user ID
You could ask for the history of screen names associated with a user ID that's being tracked
You could ask for the user ID associated with a screen name at a particular point in time (typically "now", but you might be trying to find Twitter user IDs from an old dataset of screen names)

(2 might be quite fun / interesting for some people.) Is that the kind of thing you meant?

On the other hand, using the Twitter API for this is really quite easy anyway, so I'm not sure how worthwhile it would be.

mhl commented 8 years ago

One other point about this PR is that it just replaces the old Twitter screen name contact detail, but since ContactDetail objects can have a start and end date, we could keep old ContactDetails but set their start_date / end_date appropriately.

(This would mean some more widespread code changes, though, so I'm inclined to think that should be a new ticket, to see if people like the idea.)

struan commented 8 years ago

It strikes me that a more EP way of dealing with this is just to hand something a CSV of IDs and twitter names and then have it update that CSV with twitter IDs and then call a webhook to fetch it back and then you can ingest the list however you want.

struan commented 8 years ago

This all looks good with the fixups.

mhl commented 8 years ago

OK, clearly I'm not used to the EveryPolitician mindset. Here's my attempt at working out how this would work in practice:

You have an twitter-id-data repository, with a structure like:

index.csv
data/

If you want to add a set of screen names / user IDs to be tracked, you'd make a pull request to twitter-id-data which:

Adds a row to index.csv which includes a string identifying your project (uk-candidates, say)
Adds a CSV file at data/uk-candidates.csv which contains at minimum two columns: twitter_screen_name and twitter_user_id. You can put all the known screen names and user IDs you have already in either of those columns.

We'd periodically run a script which collects all current values from the user_id and screen_name columns in any CSV file under data and finds their corresponding screen name / user ID / from the Twitter API. For each of those, it updates (adding if necessary) first_found_valid and last_found valid columns with timestamps, filling in any missing values, or creates a new row with if one of the values in the mapping has changed. (A user ID or screen name that was never found would be left with that single value in its row.) This will build up a history of the mappings between screen name <-> user ID, with first_found_valid and last_found_valid. It could also create a file which just has the current mappings called data/current/uk-candidates.csv. If any of the screen name <--> user ID mappings have changed on this run of the script, it would fire any webhooks registered for the project name.

tmtmtmtm commented 8 years ago

@struan is it possible to run this management command against the results of the 2015 General Election? We still use the data from that within EP and a lot of the twitter handles are now stale, which is causing us some problems in merging with data from other sources…

struan commented 8 years ago

@tmtmtmtm You'd need to ask Sym as I don't have access to the DC server.

mysociety / yournextrepresentative

Store Twitter user IDs as well as screen names #864