Closed dolsysmith closed 3 years ago
Would you add to example.env
the new variables and suggested values?
SPARK_MAX_FILE_SIZE=2g
SPARK_PARTITION_SIZE=128m
And then could you also add something to the README about the full_datasets
directory needing to be set up as a shared NFS mount, available to all nodes in the Spark cluster? That will be necessary for setting up future dev environments using the Spark loader (and for reconfiguring any current dev instances). It's not something that described well to begin with in our current README.
Reviewed the updated documentation and looks good!
Features
full_datasets
path (assuming this is or can be configured as an NFS mount)dataset_loading
totweetsets-data/full_datasets
(or equivalent paths as defined in.env
).env
variableSetup
full_datasets
folder must be a shared NFS mount available to all nodes in the Spark cluster. (On my VM, I moved thetweetsets_data
folder to/storage
on both the primary and secondary nodes, then mapped thefull_datasets
folder on the primary to the same location on the secondary VM.)tweetsets_data
folder, as that will likely cause problems for Elasticsearch./storage/tweetsets_data/full_datasets
(VM 1) to a folder on VM 2 in/home/dsmith
, but that did not seem to work..env
files accordingly with the new paths, if necessary.docker-compose.yml
as follows:spark-worker
section, undervolumes
:${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
loader.docker-compose.yml
as follows:volumes
, add${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
environment
, addSPARK_MAX_FILE_SIZE
SPARK_PARTITION_SIZE
.env
, add the following:SPARK_MAX_FILE_SIZE=2g
SPARK_PARTITION_SIZE=128m
server-flaskrun
andloader
containers should be built locally. Make sure you rebuild the images before restarting the containers.Testing
tweet-ids
extract matches the number of tweets in the UI.Benchmarks
The following metrics were obtained using a subset of the Summer Olympics collection.