For retweets, consider extracting text of original tweet (when present) to provide fuller context for truncated retweets

kerchner commented 9 years ago

Retweets commonly have the form: RT @original_tweeter Original tweet text.

Twitter appears to truncate the retweet, including the prefixes, to 144 characters.

In tweets which Twitter "recognizes" as retweets, i.e. where mytweet["retweeted_status"] is not None, the original, non-truncated tweet text is available as mytweet["retweeted_status"]["text"].

This could in theory be used to replace the original (and truncated) tweet text portion of the retweet in item_text.

We should verify that these always match; are there cases where the retweet text might diverge from the original tweet? If so, then replacing it might create an accuracy/integrity issue, and we might not want to overwrite it (although we would never change the raw JSON as stored - this discussion is only regarding item_text).

More conservative options might include:

Adding a new field - e.g. original_text_of_retweet or something to that effect - to which we extract ["retweet_status"]["text"] and make it available (optionally?) in extracts. We could include a column with the original tweet text, and another column with our best guess at "fixing" the retweet.
Adding a flag to the extract commands to indicate whether or not to "fix" item_text. This still entails the risk that extracts then include item_text values that don't match item_text in our database.

Note also that the ["truncated"] node seems to be unreliable. As an example, this retweet truncated the original tweet, but ["truncated"] is false: http://sfm.library.gwu.edu/twitter-item/7695264/

dchud commented 9 years ago

We do something like this for is_retweet, adding a column to the csv output using our own logic to catch retweets that didn't use twitter's retweet function. Researchers asked for this.

Has someone asked us to do something like this?

At most we should add a value rather than changing anything received directly from twitter.

kerchner commented 9 years ago

@dchud yes this was requested by the student project team from the Elliott School when they noticed that the text of some retweets is truncated (relative to the original tweet).

It sounds like you concur with the first bullet in the comment above (the first comment) that at most we should add a new value to surface ["retweet_status"]["text"] when present - and/or a new value which computes a "complete" (i.e. un-truncated) retweet using ["retweet_status"]["text"] when present.

gwu-libraries / social-feed-manager

For retweets, consider extracting text of original tweet (when present) to provide fuller context for truncated retweets #294