gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Use Spark to create dataset extracts (including mentions) #117

Closed dolsysmith closed 3 years ago

dolsysmith commented 3 years ago

The Dockerfile-spark container includes both Spark and Hadoop, but (as I understand it) the latter is used only to connect Spark with Elasticsearch via the loader container. Thus, we have no way to persist the output from Spark on disk (since in cluster mode, Spark requires an HDFS).

By adding steps to Dockerfile-spark to configure and launch Hadoop with an external volume, I think we could do the following:

Correction: we are using Spark in standalone mode, which does not require HDFS. But it does require that all nodes in the cluster have access to the same storage, which can be NFS, HDFS, or S3. Since we meet this condition with the /dataset_loading volume, I think we would need to configure the /storage volume in a similar fashion. Doing so would support the following use cases:

  1. Use the loader to create and store the full extracts at time of ingest.
  2. Use Spark jobs to create user-defined extracts more efficiently.
dolsysmith commented 3 years ago

Closed for overlap with #128