gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Consider feasibility of shortcut to download an entire dataset #83

Closed kerchner closed 3 years ago

kerchner commented 3 years ago

Also, we'll want to compute top mentions, top hashtags, and all of the other extracts upon ingest.

kerchner commented 3 years ago

Note to self (@kerchner) to notify David Broniatowski when this feature is deployed.

dolsysmith commented 3 years ago

It was fairly straightforward to implement this in Spark running in local mode; unfortunately, the TweetSets Spark loader runs in cluster mode, which is designed for reading and writing files to HDFS. (The Spark loader leverages the fact that Elasticsearch has a Hadoop API, which I guess emulates an HDFS for data transfer.) Still looking for an expedient way to transfer the extracts to TweetSets local storage.

An alternative would be to use the existing code in tasks.py to create the extracts. However, that code is run as a Celery task called from tweetset_server.py, so it would probably be necessary to write an additional script to trigger the task in the context of the loader container.

lwrubel commented 3 years ago

Add GA event tracking to downloading the full dataset files.

lwrubel commented 3 years ago

Closed with #112.