gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
26 stars 2 forks source link

T31 upgrade pyspark Fixes #31 #68

Closed dolsysmith closed 3 years ago

dolsysmith commented 3 years ago

Instructions:

On the primary node's VM, do the following:

docker rmi ts_worker ts_spark-master
docker-compose build --no-cache

On the secondary VM, do the following:

docker rmi ts_spark-worker
docker-compose build --no-cache

Then do docker-compose up -d on both.

Please test loading with the spark-loader:

docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash

spark-submit \
 --jars elasticsearch-hadoop.jar \
 --master spark://$SPARK_MASTER_HOST:7101 \
 --py-files dist/TweetSets-2.0-py3.6.egg,dependencies.zip \
 --conf spark.driver.bindAddress=0.0.0.0 \
 --conf spark.driver.host=$SPARK_DRIVER_HOST \
 tweetset_loader.py spark-create /dataset/path/to/files

Note: I had to comment out the image line and uncomment the build instructions in loader.docker-compose.yml. Those changes are part of this commit, but we'll want to revert back once the image is updated in Docker.

lwrubel commented 3 years ago

Tested spark-create and spark-reload commands and reviewed export from resulting tweetsets. Works as expected.

To keep the code clean, I'd suggest changing the loader.docker-compose.yml to use the image line. We can then update our local instances to use the build option. Once that's done, go ahead to merge.