ushahidi / platform

Ushahidi Platform API version 3+
http://ushahidi.com
Other
682 stars 506 forks source link

Comply with Twitter ToS for storing Twitter Data #3329

Closed Shadrock closed 5 months ago

Shadrock commented 6 years ago

Overview

This issue is two fold. First, I need to re-open #3073 and have those datasets exported again, only this time there must be an associated column for all Twitter data and that column shall include the Tweet ID (possibly also known as the Snowflake?). Secondly, we need to change the way Tweets are stored in Ushahidi because we are potentially in violation of Twitter's Developer Policy and, more specifically, our static archiving of Tweets can potentially put users at risk.

How Ushahidi Stores Twitter Data

When I set up a deployment to consume Tweets, they are displayed in the platform as static text. Moreover, the - bizarrely named - "conversation with author" pane in the data view mode, displays a static list of every Tweet written by that author: including Tweets they have deleted from their account. The screenshot below can be viewed in the COMRADES deployment with this post (you must be logged in).

twitter blues

This is problematic because it puts us in possible violation of Twitter's Developer Policy, specifically section F.2: Be a Good Partner to Twitter, which sets forth limits and how Tweets can be shared. It specifically states that:

If you provide Twitter Content to third parties, including downloadable datasets of Twitter Content or an API that returns Twitter Content, you will only distribute or allow download of Tweet IDs, Direct Message IDs, and/or User IDs.

The, um... "bright side" is that:

You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweet Objects and/or User Objects per user of your Service, per day.

So there is apparently some room to wiggle here: if we create a non-API enabled deployment with low traffic we would likely be fine but if we have a high-traffic, API enabled deployment, we would likely violate these terms.

What privacy impacts does it have?

The privacy impacts here are two fold. In the most disastrous scenario, an activist (or anyone really) might report to a deployment via Twitter (or simply post a Tweet that the deployment collects unbeknownst to them); then realize that the Tweet puts them in some sort of risk and delete the Tweet from their account. The Tweet would still be archived in the deployment thus putting the activist at risk!!

The knock-on effect this is having for COMRADES is that we are also sharing deployment data among consortium partners for research purposes. We are, of course, adhering to the strict security rules of the EU, and COMRADES specifically: we secure all data with PII and won't re-publish it or share it outside the consortium until it's been redacted. But it still raises some difficult questions if want to do things like create a publicly available training dataset for future machine learning algorithms (something we do want to do). The normal process for this would be to eliminate the actual Tweet content from a data set and provide instead the Twitter ID; leaving the task of re-constituting the Tweet content to the next person to process the data set. There isn't a lot of information out there about this practice (called "re-hydrating") but this blog post does a pretty good job and is the best I've found. What this means is that any Tweets or accounts that have been deleted in the interim can't be "re-hydrated", thus preserving the Twitter user's right to privacy - in this case to be forgotten.

Again, there is probably some room for interpretation here since I'm currently having this conversation in England (which adheres to the GDPR) at a university that has taken a very, very conservative interpretation of GDPR to heart. Rules for a deployer in Kenya or the U.S. might be very different. It's worth point out that Ushahidi also adheres to the GDPR.

I've spoken with @rowasc about this in the Platform Channel on Slack and she's confirmed that we do capture the Tweet IDs in the database (why they didn't make into the downloads requested for #3073 is a mystery... looking at you @willdoran) so it's my assumption that fixing this issue would basically involve changing the display of Tweets from static text, to Twitter cards (looks like @dalezak might have a related issue with #3012?). Tweet IDs also need to figure prominently in all deployment exports. My reading of all this is that a deployment administrator could download Tweet content as well as IDs for an internal use dataset (since they aren't distributing it), but we would probably want to configure a permission-based setting that would remove this for any "public" download/access of the data and only provide the Tweet ids for posts that were culled from Twitter.

Ok. That's all I have on this for now. Happy to add more as needed.

Aha! Link: https://ushahiditeam.aha.io/features/PROD-278

willdoran commented 6 years ago

@Shadrock To answer in reverse order, we used to have tweet ids in the CSV export but they were apparently confusing or unwanted so they were dropped, this can be added back.

Changing from displaying/storing to the message content to simply using the Tweet ID is possible, however, we'll have to change the way in which the data is used. At the moment, the message becomes an entry in a text field which becomes the description portion of a post. We could change this to instead have a "Tweet field" which uses the Tweet ID and displays the Tweet as a Tweet.

There are three other pieces that I'm really concerned by

  1. We should drop storage of geographic information from tweets and expunge it retrospectively along with the other data
  2. How do we allow users to search the data? That could in and off itself be problematic
rowasc commented 6 years ago

@willdoran

  1. Agree. This shouldn't be a huge problem at the technical level, but I can see how for users, deleting content would be annoying and potentially problematic if they are used to getting the data (including all the content) easily in exports etc and now they can't anymore.
  2. for search, I think it would have to become a feature that searches against the twitter API directly, since we no longer have (in this scenario) any real content? But at the same time it's a bit weird , do we search only against twitter ids that we already have? Do we search by the hashtags the user inputs, that maybe include tweets that have not yet been added to our DB?

Changing from displaying/storing to the message content to simply using the Tweet ID is possible, however, we'll have to change the way in which the data is used. At the moment, the message becomes an entry in a text field which becomes the description portion of a post. We could change this to instead have a "Tweet field" which uses the Tweet ID and displays the Tweet as a Tweet.

This means we would look up the tweet in real time when they select a post with a tweet it in the platform (or embed the tweet with their widgets) , right?
What happens if a user copies the tweet's content into survey fields (manually), by the way? Are we responsible? is that now something that we need to add to our TOS and actively monitor?

willdoran commented 6 years ago

@rowasc Those are good points, we'll need to arrange a discussion with @justinscherer and @jrtricafort

rjmackay commented 6 years ago

I'm not really clear if we can use Tweet locations or not.

Geographic Data. Your license to use Twitter Content in this Agreement does not allow you to (and you will not allow others to) aggregate, cache, or store location data and other geographic information contained in the Twitter Content, except in conjunction with the Twitter Content to which it is attached. Your license only allows you to use such location data and geographic information to identify the location tagged by the Twitter Content. Any use of location data or geographic information on a standalone basis or beyond the license granted herein is a breach of this Agreement.

Twitter Content ‒ Tweets, Tweet IDs, Twitter end user profile information, Periscope Broadcasts, Broadcast IDs and any other data and information made available to you through the Twitter API or by any other means authorized by Twitter, and any copies and derivative works thereof.

It seems like if we store the location with the Tweet ID maybe we can use it? But we probably shouldn't allow it to be exported

dalezak commented 6 years ago

Sorry @Shadrock @willdoran @rowasc @rjmackay, I'm just catching up on this issue, since it looks like it might require some changes to the mobile app for https://github.com/ushahidi/platform/issues/3342.

Are these changes we're proposing here explicitly required from Twitter? Or are they changes we're imposing on ourself to be a good partner?

I'm just trying to better understand the problem we are trying to solve here.

Is the issue with deleted tweets still being displayed on our platform? If so, then this can probably be handled server side with background job that purges deleted posts? Or possibly changing post text to something like Content has been removed?

Or is the issue with displaying tweet content in a custom format? If this is the problem, then likely any Twitter client like Tweetbot, Twitterrific, etc would have issues as well?

Note, the issue https://github.com/ushahidi/platform/issues/3012 is unrelated to this, that's more for improving how any content from Platform is shared on Twitter.

willdoran commented 6 years ago

@dalezak They are explicitly required by the Twitter terms of service. Under no circumstances can we copy or store any Tweet data other than Tweet ID and User ID.

dalezak commented 6 years ago

Hmm, @willdoran the paragraph that @Shadrock shared above sounds more related to downloading of content like through CSV export:

If you provide Twitter Content to third parties, including downloadable datasets of Twitter Content or an API that returns Twitter Content, you will only distribute or allow download of Tweet IDs, Direct Message IDs, and/or User IDs.

The part API that returns Twitter Content may be an issue, however any Twitter client would also violate the ToS in this case, since that's how apps get their data.

Has anyone contacted Twitter to get clarification on what needs to be done to comply to their ToS?

I'm worried we're heading down a technical rabbit hole when it might not be necessary to comply.

willdoran commented 6 years ago

@dalezak We've reviewed in detail the Twitter Terms of Services, these changes are necessary for us to comply with them and more importantly with GDPR. This is a grant requirement for two of our largest on going projects.