gwu-libraries / social-feed-manager

"Old SFM" -- manage rules and streams from social data sources, starting with twitter.
MIT License
87 stars 20 forks source link

For retweets, consider extracting text of original tweet (when present) to provide fuller context for truncated retweets #294

Open kerchner opened 9 years ago

kerchner commented 9 years ago

Retweets commonly have the form: RT @original_tweeter Original tweet text.

Twitter appears to truncate the retweet, including the prefixes, to 144 characters.

In tweets which Twitter "recognizes" as retweets, i.e. where mytweet["retweeted_status"] is not None, the original, non-truncated tweet text is available as mytweet["retweeted_status"]["text"].

This could in theory be used to replace the original (and truncated) tweet text portion of the retweet in item_text.

We should verify that these always match; are there cases where the retweet text might diverge from the original tweet? If so, then replacing it might create an accuracy/integrity issue, and we might not want to overwrite it (although we would never change the raw JSON as stored - this discussion is only regarding item_text).

More conservative options might include:

Note also that the ["truncated"] node seems to be unreliable. As an example, this retweet truncated the original tweet, but ["truncated"] is false: http://sfm.library.gwu.edu/twitter-item/7695264/

dchud commented 9 years ago

We do something like this for is_retweet, adding a column to the csv output using our own logic to catch retweets that didn't use twitter's retweet function. Researchers asked for this.

Has someone asked us to do something like this?

At most we should add a value rather than changing anything received directly from twitter.

kerchner commented 9 years ago

@dchud yes this was requested by the student project team from the Elliott School when they noticed that the text of some retweets is truncated (relative to the original tweet).

It sounds like you concur with the first bullet in the comment above (the first comment) that at most we should add a new value to surface ["retweet_status"]["text"] when present - and/or a new value which computes a "complete" (i.e. un-truncated) retweet using ["retweet_status"]["text"] when present.