gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

TweetSets

DOI

Twitter datasets for research and archiving.

TweetSets allows users to (1) select from existing datasets; (2) limit the dataset by querying on keywords, hashtags, and other parameters; (3) generate and download dataset derivatives such as the list of tweet ids and mention nodes/edges.

Modes

TweetSets can be run in different modes. The modes determine which datasets are available and what type of dataset derivates can be generated.

These modes allow conforming with the Twitter policy that prohibits sharing complete tweets with 3rd parties.

Modes are configured in the .env file as described below.

Installing

Prerequisites

Installation for non-cluster ElasticSearch

  1. Create data directories on a volume with adequate storage:

    mkdir -p /tweetset_data/redis
    mkdir -p /tweetset_data/datasets
    mkdir -p /tweetset_data/full_datasets
    mkdir -p /tweetset_data/elasticsearch/esdata1
    mkdir -p /tweetset_data/elasticsearch/esdata2
    chown -R 1000:1000 /tweetset_data/elasticsearch

Note:

  1. Create a directory, to be named as you choose, where tweet data files will be stored for loading.

    mkdir /datasets_loading
  2. Clone or download this repository:

    git clone https://github.com/gwu-libraries/TweetSets.git
  3. Change to the docker directory:

    cd docker
  4. Copy the example docker files:

    cp example.docker-compose.yml docker-compose.yml
    cp example.env .env
  5. Edit .env. This file is annotated to help you select appropriate values.

  6. Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

    touch dataset_list_msg.txt
  7. Bring up the containers:

    docker-compose up -d

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster installation

Clusters must have at least a primary node and two additional nodes.

Primary node

  1. Create data directories on a volume with adequate storage. Note that in order to use the Spark loader, the full_datasets and datasets_loading directories (see below) will need to be shared between the primary and cluster nodes as an NFS mount. (The other directories do not need to be shared.)

    mkdir -p /tweetset_data/redis
    mkdir -p /tweetset_data/datasets
    mkdir -p /tweetset_data/full_datasets
    mkdir -p /tweetset_data/elasticsearch
    chown -R 1000:1000 /tweetset_data/elasticsearch
  2. Create a directory, to be named as you choose, where tweet data files will be stored for loading.

    mkdir /datasets_loading
  3. Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.

  4. Clone or download this repository:

    git clone https://github.com/gwu-libraries/TweetSets.git
  5. Change to the docker directory:

    cd docker
  6. Copy the example docker files:

    cp example.cluster-primary.docker-compose.yml docker-compose.yml
    cp example.env .env
  7. Update .env. This file is annotated to help you select appropriate values.

  8. Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

    touch dataset_list_msg.txt

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster node(s)

  1. Create data directories on a volume with adequate storage:

    mkdir -p /tweetset_data/elasticsearch
    mkdir -p /tweetset_data/full_datasets
    chown -R 1000:1000 /tweetset_data/elasticsearch
    mkdir /datasets_loading
  2. Clone or download this repository:

    git clone https://github.com/gwu-libraries/TweetSets.git
  3. Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.

  4. Change to the docker directory:

    cd docker
  5. Copy the example docker files:

    cp example.cluster-node.docker-compose.yml docker-compose.yml
    cp example.cluster-node.env .env
  6. Edit .env. This file is annotated to help you select appropriate values. Note that 2 cluster nodes must have MASTER set to true.

  7. Bring up the containers, starting with the cluster nodes and then moving to the primary node.

    docker-compose up -d

Loading a source dataset

Prepping the source dataset

  1. Create a dataset directory within the dataset filepath configured in your .env.
  2. Place tweet files in the directory. The tweet files can be line-oriented JSON (.json) or gzip compressed line-oriented JSON (.json.gz).
  3. Create a dataset description file in the directory named dataset.json. See example.dataset.json for the format of the file.

Loading

Use this method when Elasticsearch is on the same machine as TweetSets (non-cluster option), or for otherwise loading without using Spark.

  1. Start and connect to a loader container:

    docker-compose run --rm loader /bin/bash
  2. Invoke the loader:

    python tweetset_loader.py create /dataset/path/to

To see other loader commands:

    python tweetset_loader.py

Note that tweets are never added to an existing index. When using the reload command, a new index is created for a dataset that replaces the existing index. The new index replaces the old index only after the new index has been created, so users are not affected by reloading.

Loading with Apache Spark

When using the Spark loader, the dataset files must be located at the dataset filepath on all nodes. All nodes must also have access to shared directory (tweetset_data/full_datasets) for creating the full extracts. For creating full extracts, this process is more efficient than the method described below ("Creating a manual extract").

In general, using Spark within Docker is tricky because the Spark driver, Spark master, and Spark nodes all need to be able to communicate and the ports are dynamically selected. (Some of the ports can be fixed, but supporting multiple simultaneous loaders requires leaving some dynamic.) This doesn't play well with Docker's port mapping, since the hostnames and ports that Spark advertises internally must match what is available through Docker. Further complicating this is that host networking (which is used to support the dynamic ports) does not work correctly on Mac. Use the regular loader rather than the Spark loader Elasticsearch is on the same machine as TweetSets (e.g., in a small development environment, not a cluster).

Cluster mode

  1. Start and connect to a loader container:

    docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash
  2. Invoke the loader:

    spark-submit \
    --jars elasticsearch-hadoop.jar \
    --master spark://$SPARK_MASTER_HOST:7101 \
    --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
    --conf spark.driver.bindAddress=0.0.0.0 \
    --conf spark.driver.host=$SPARK_DRIVER_HOST \
    --conf spark.driver.port=7003 \
    --conf spark.blockManager.port=7020 \
    tweetset_loader.py spark-create /dataset/path/to
  3. Extracts will be stored in /tweetset_data/full_datasets and will be visible in the UI.

Reloading an existing set with Apache Spark

  1. Start and connect to a loader container:

    docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash
  2. Invoke the loader:

    spark-submit \
    --jars elasticsearch-hadoop.jar \
    --master spark://$SPARK_MASTER_HOST:7101 \
    --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
    --conf spark.driver.bindAddress=0.0.0.0 \
    --conf spark.driver.host=$SPARK_DRIVER_HOST \
    --conf spark.driver.port=7003 \
    --conf spark.blockManager.port=7020 \
    tweetset_loader.py spark-reload dataset-id /dataset/path/to

where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.

Note that running spark-reload does not re-read dataset.json and update the dataset descriptive metadata. To update the dataset descriptive metadata to match dataset.json if it has been changed, invoke the loader with an update command:

    spark-submit \
    --jars elasticsearch-hadoop.jar \
    --master spark://$SPARK_MASTER_HOST:7101 \
    --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
    --conf spark.driver.bindAddress=0.0.0.0 \
    --conf spark.driver.host=$SPARK_DRIVER_HOST \
    --conf spark.driver.port=7003 \
    --conf spark.blockManager.port=7020 \
    tweetset_loader.py update dataset-id /dataset/path/to

Creating a manual extract (dataset)

Full extracts of existing datasets can be created from the command line. Note that this command does not use the Spark loader, so will not generate mentions or user tweet count files. It generates .zip versions of JSON, CSV, and IDs.

  1. Launch a shell session in the server container:

    docker exec -it ts_server_1 /bin/bash

    or

    docker exec -it ts_server-flaskrun_1 /bin/bash
  2. Issue the command to create the extract, where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.

flask create-extract dataset_id
  1. Upon completion, an email will be sent to the address in the ADMIN_EMAIL field of the .env file.

Kibana

Elastic's Kibana is a general-purpose framework for exploring, analyzing, and visualizing data. Since the tweets are already indexed in ElasticSearch, they are ready to be used from Kibana.

To enable Kibana, uncomment the Kibana service in your docker-compose.yml. By default, Kibana will run on port 5601.

A few notes about Kibana:

Citing

Please cite TweetSets as:

    Justin Littman, Laura Wrubel, Dan Kerchner, Dolsy Smith, Will Bonnett. (2020). TweetSets. Zenodo. https://doi.org/10.5281/zenodo.1289426

Development

Unit tests

Run outside the container.

python -m unittest

The Spark loader has its own set of unit tests. These will be copied to the TweetSets/tests directory when creating the loader container. Run them within the loader container with python -m unittest.

Kibana TODO

TweetSets TODO