lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Example counting prevalence of tweeted images #214

Closed lintool closed 8 years ago

lintool commented 8 years ago

This example works with Warcbase (on rho):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.matchbox.TweetUtils._
import org.warcbase.spark.rdd.RecordRDD._
import org.json4s._
import org.json4s.jackson.JsonMethods._

val tweets = RecordLoader.loadTweets("/mnt/vol1/data_sets/elxn42/ruest-white/elxn42-tweets-combined-deduplicated.json", sc)

val counts = tweets.flatMap(tweet => tweet \\ "media_url_https" \ classOf[JString] )
    .countItems()
    .collect()

Results:

counts: Array[(org.json4s.JString#Values, Int)] = Array((https://pbs.twimg.com/media/CRvL6hnVEAE_mvv.jpg,11558), (https://pbs.twimg.com/ext_tw_video_thumb/635933769208193025/pu/img/ZrrpFszwfGfdUZuR.jpg,8876), (https://pbs.twimg.com/media/CRj91ZqUcAAr4KS.jpg,7896), (https://pbs.twimg.com/media/CRqFEyCWEAAj9VK.jpg,6258), (https://pbs.twimg.com/media/CRDXt1CU8AAoiWA.jpg,6122), (https://pbs.twimg.com/media/CRn4WnhWEAAmaSB.jpg,5776), (https://pbs.twimg.com/media/CRpE6D6UEAA_8zB.png,5430), (https://pbs.tw...
jrwiebe commented 8 years ago

Added to docs. Closing.