medianeuroscience / icore

This project introduces the interface for Communication Research (iCoRe) to access, explore, and analyze the Global Database of Events, Language and Tone (GDELT; Leetaru & Schrodt, 2013). GDELT provides a vast, open source, and constantly updated repository of online news and event metadata collected from tens of thousands of news outlets around the world. Despite GDELT’s promise for advancing communication science, its massive scale and complex data structures have hindered efforts of communication scholars aiming to access and analyze GDELT. We thus developed iCoRe, an easy-to-use web interface that (a) provides fast access to the data available in GDELT (b) shapes and processes GDELT for theory-driven applications within communication research and (c) enables replicability through transparent query and analysis protocols.
https://icore.medianeuroscience.org
8 stars 2 forks source link

API for sharing data #2

Open fhopp opened 4 years ago

fhopp commented 4 years ago

We need to find a good API that lets us obtain sharing data of newspaper articles.

fhopp commented 4 years ago

Check out Facebook's Graph API: https://developers.facebook.com/docs/graph-api

fhopp commented 4 years ago

@Yibeichan and @fhopp discussed today that we can get very informative data via the Twitter API. For now, we are going to focus on "shares" on Twitter and later return to the idea of shares on Facebook. The single unit for scraping twitter data will still be a single URL. However, we will create two extra tables in cassandra: 1) twitter_shares 2) twitter_tweets

For (1), each row will be a unique URL and columns will consist of the unique number of users that mentioned this tweet and the total retweet counts, total likes, total comments of this URL.

For (2), each row will be the unique tweet that mentioned this URL along with metadata for that tweet such as the text, how many likes the tweet has gotten, how many replies, favorites etc.

Next step for @Yibeichan is to think about how we can retrieve so many URLs. @musainayatmalik will help with implementing the "twitter scraping" pipeline in PySpark.

yibeichan commented 4 years ago

several ways to get historical twitter data (sorted) 1) http://www.orgneat.com/ (free) it doesn’t allow to download tweets. We would be ok, we just need the retweet number. And if we get tweet ID, I can try other ways to get tweets. 2) use multi-public twitter database, choose certain topics or combine them together, search news link among tweets, get share counts. database: https://www.docnow.io/catalog/ (free) Most of these database are topic/event/keyword-specified 3) https://github.com/Jefferson-Henrique/GetOldTweets-python (free) I used it before, but it doesn’t contain the deleted data. We can give it a try. 4) https://codecanyon.net/item/historical-tweets/22120633 ($14 purchase) It seems good, it’s an app. 5) https://sifter.texifter.com/ this website has the complete, undeleted historical data of Twitter between 01/14/2014-09/29/2018, and can be cleaned by https://discovertext.com/ ( $24/month) However, we need contact Twitter to get approval to use the data 6) https://www.trackmyhashtag.com/historical-twitter-data (pay) this one is based on hashtag to get historical data 7) https://www.tweetbinder.com/payments/#/process-payment/historical (one-time purchase?) historical data, limitied to 140,000 tweets

fhopp commented 4 years ago

@Yibeichan , can we close this now? We are using sharedcount.com to get facebook data and we will pay to get the twitter data? Can you open an issue for the Twitter data and comment the link to the company so I can get started on the application? Thanks!