gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

T140 path to extracts: #140 #153

Closed dolsysmith closed 2 years ago

dolsysmith commented 3 years ago

Features

Setup

  1. The full_datasets folder must be a shared NFS mount available to all nodes in the Spark cluster. (On my VM, I moved the tweetsets_data folder to /storage on both the primary and secondary nodes, then mapped the full_datasets folder on the primary to the same location on the secondary VM.)
    • Note: you don't want to share the entire tweetsets_data folder, as that will likely cause problems for Elasticsearch.
    • I initially tried mapping /storage/tweetsets_data/full_datasets (VM 1) to a folder on VM 2 in /home/dsmith, but that did not seem to work.
  2. Update your .env files accordingly with the new paths, if necessary.
  3. On your non-primary nodes, update docker-compose.yml as follows:
    • Add the following line to the spark-worker section, under volumes: ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
  4. On the primary node, update loader.docker-compose.yml as follows:
    • Under volumes, add ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
    • Under environment, add
      • SPARK_MAX_FILE_SIZE
      • SPARK_PARTITION_SIZE
    • Optional:add the following (in order to expose the Spark jobs UI):
      ports:
      - 4040:4040
  5. To your primary node's .env, add the following:
    • SPARK_MAX_FILE_SIZE=2g
    • SPARK_PARTITION_SIZE=128m
  6. For testing, the server-flaskrun and loader containers should be built locally. Make sure you rebuild the images before restarting the containers.

Testing

  1. Load a dataset.
  2. Verify that full extracts are created and available in the UI.
  3. Verify that extracts are downloadable and that, for all extracts except the full-tweet JSON, a small number of files are created. (For smaller datasets, each extract should have one file.)
  4. Verify that the number of (non-header) rows in the tweet-ids extract matches the number of tweets in the UI.
  5. Create custom extracts and verify that these are downloadable and created correctly.

Benchmarks

The following metrics were obtained using a subset of the Summer Olympics collection.

Metric Value
Number of workers 1
Number of cores 2
Number of tweets 1,048,637
Size on disk 980M
Number of gzipped files 32
Operation Time
RDD -> Elasticsearch 12 min
tweet-ids 50 sec
tweet-csv 3.8 min
tweet-mentions/nodes 1.1 min
tweet-mentions/edges 55 s
tweet-mentions/agg 1 min
tweet-users 1 min
lwrubel commented 2 years ago

Would you add to example.env the new variables and suggested values?
SPARK_MAX_FILE_SIZE=2g SPARK_PARTITION_SIZE=128m

And then could you also add something to the README about the full_datasets directory needing to be set up as a shared NFS mount, available to all nodes in the Spark cluster? That will be necessary for setting up future dev environments using the Spark loader (and for reconfiguring any current dev instances). It's not something that described well to begin with in our current README.

lwrubel commented 2 years ago

Reviewed the updated documentation and looks good!