Closed kerchner closed 3 years ago
Note to self (@kerchner) to notify David Broniatowski when this feature is deployed.
It was fairly straightforward to implement this in Spark running in local
mode; unfortunately, the TweetSets Spark loader runs in cluster
mode, which is designed for reading and writing files to HDFS. (The Spark loader leverages the fact that Elasticsearch has a Hadoop API, which I guess emulates an HDFS for data transfer.) Still looking for an expedient way to transfer the extracts to TweetSets local storage.
An alternative would be to use the existing code in tasks.py
to create the extracts. However, that code is run as a Celery task called from tweetset_server.py
, so it would probably be necessary to write an additional script to trigger the task in the context of the loader
container.
Add GA event tracking to downloading the full dataset files.
Closed with #112.
Also, we'll want to compute top mentions, top hashtags, and all of the other extracts upon ingest.