pnnl / socialsim_package

Other
17 stars 25 forks source link

Expanding shortened URLs #21

Closed sameeravithana closed 5 years ago

sameeravithana commented 5 years ago

We found many “twitter.com” domains exist in the processed events. We believe this is due to your mapping of shortened URLs starting with “t.co/" to “twitter.com/”. As an example, you find "t.co" shortened URL version in many re-tweets. do we have enough knowledge to decode it as “twitter.com”? what does this URL refer?

Example Tweet: "I'll never forget this attack, it surely was one of the darkest nights of war for me.”

RT of the above Tweet: RT @un: AjlimofxAFVR36siO2G2-g : I'll never forget this attack, it surely was one of the darkest nights of war for me. url: https://t.co/v94nb8UOfWQuPP09WpnUfQ

Twitter API describes the notion behind t.co links/ [1], but I didn’t find any information to decode them back. [1] t.co links https://developer.twitter.com/en/docs/basics/tco

Maria-G commented 5 years ago

Hi @SamTube405,

Thanks for raising this question. The data extraction script does not rely directly on the representation of the url in the tweet itself for extracting URLs linked in Twitter data. extract_ground_truth.py pulls the expanded form of the URL from the 'expanded_url' field, which includes the actual domain rather than twitter.com, unless linking to a twitter.com page (e.g. a status):

urls_in_text = tweets['entities'].apply(lambda x: [y['expanded_url'+name_suffix] for y in x['urls']])

For retweets that include a t.co/ link back to the Twitter status being retweeted, this will be represented as a URL link within Twitter.