T140 path to extracts: #140

dolsysmith commented 3 years ago

Features

Updates UI to display full extracts created by Spark
Reuses existing full_datasets path (assuming this is or can be configured as an NFS mount)
Adds aggregate users extract type
Copies JSON files from dataset_loading to tweetsets-data/full_datasets (or equivalent paths as defined in .env)
Uses repartitioning to optimize loading of multiple files
Coalesces extracts into a smaller number of files, using the max file size .env variable

Setup

The full_datasets folder must be a shared NFS mount available to all nodes in the Spark cluster. (On my VM, I moved the tweetsets_data folder to /storage on both the primary and secondary nodes, then mapped the full_datasets folder on the primary to the same location on the secondary VM.)
- Note: you don't want to share the entire tweetsets_data folder, as that will likely cause problems for Elasticsearch.
- I initially tried mapping /storage/tweetsets_data/full_datasets (VM 1) to a folder on VM 2 in /home/dsmith, but that did not seem to work.
Update your .env files accordingly with the new paths, if necessary.
On your non-primary nodes, update docker-compose.yml as follows:
- Add the following line to the spark-worker section, under volumes: ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
On the primary node, update loader.docker-compose.yml as follows:
- Under volumes, add ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
- Under environment, add
  - SPARK_MAX_FILE_SIZE
  - SPARK_PARTITION_SIZE
- Optional:add the following (in order to expose the Spark jobs UI):
```
ports:
- 4040:4040
```
To your primary node's .env, add the following:
- SPARK_MAX_FILE_SIZE=2g
- SPARK_PARTITION_SIZE=128m
For testing, the server-flaskrun and loader containers should be built locally. Make sure you rebuild the images before restarting the containers.

Testing

Load a dataset.
Verify that full extracts are created and available in the UI.
Verify that extracts are downloadable and that, for all extracts except the full-tweet JSON, a small number of files are created. (For smaller datasets, each extract should have one file.)
Verify that the number of (non-header) rows in the tweet-ids extract matches the number of tweets in the UI.
Create custom extracts and verify that these are downloadable and created correctly.

Benchmarks

The following metrics were obtained using a subset of the Summer Olympics collection.

Metric	Value
Number of workers	1
Number of cores	2
Number of tweets	1,048,637
Size on disk	980M
Number of gzipped files	32

Operation	Time
RDD -> Elasticsearch	12 min
tweet-ids	50 sec
tweet-csv	3.8 min
tweet-mentions/nodes	1.1 min
tweet-mentions/edges	55 s
tweet-mentions/agg	1 min
tweet-users	1 min

lwrubel commented 3 years ago

Would you add to example.env the new variables and suggested values?
SPARK_MAX_FILE_SIZE=2g SPARK_PARTITION_SIZE=128m

And then could you also add something to the README about the full_datasets directory needing to be set up as a shared NFS mount, available to all nodes in the Spark cluster? That will be necessary for setting up future dev environments using the Spark loader (and for reconfiguring any current dev instances). It's not something that described well to begin with in our current README.

lwrubel commented 3 years ago

Reviewed the updated documentation and looks good!

gwu-libraries / TweetSets