TweetSets

Twitter datasets for research and archiving.

Create your own Twitter dataset from existing datasets.
Conforms with Twitter policies.

TweetSets allows users to (1) select from existing datasets; (2) limit the dataset by querying on keywords, hashtags, and other parameters; (3) generate and download dataset derivatives such as the list of tweet ids and mention nodes/edges.

Modes

TweetSets can be run in different modes. The modes determine which datasets are available and what type of dataset derivates can be generated.

public mode: Source datasets that are marked as local only are excluded. Dataset derivates that include the text of the tweet cannot be generated.
local mode: All source datasets are included, including those that are marked as local only. All dataset derivatives can be generated, including those that include the text of the tweet.
both mode: For configured network IP ranges, the user is placed in local mode. Otherwise, the user is placed in public mode.

These modes allow conforming with the Twitter policy that prohibits sharing complete tweets with 3rd parties.

Modes are configured in the .env file as described below.

Installing

Prerequisites

Docker
Docker-compose
Set vm_max_map_count as described in the ElasticSearch documentation. Each node of the cluster may require this setting.

Installation for non-cluster ElasticSearch

Create data directories on a volume with adequate storage:

mkdir -p /tweetset_data/redis
mkdir -p /tweetset_data/datasets
mkdir -p /tweetset_data/full_datasets
mkdir -p /tweetset_data/elasticsearch/esdata1
mkdir -p /tweetset_data/elasticsearch/esdata2
chown -R 1000:1000 /tweetset_data/elasticsearch

Note:

Create an esdata<number> directory for each ElasticSearch container.
On OS X, the redis and esdata<number> directories must be ugo+rwx.

Create a directory, to be named as you choose, where tweet data files will be stored for loading.
```
mkdir /datasets_loading
```

Clone or download this repository:

git clone https://github.com/gwu-libraries/TweetSets.git

Change to the docker directory:
```
cd docker
```

Copy the example docker files:

cp example.docker-compose.yml docker-compose.yml
cp example.env .env

Edit .env. This file is annotated to help you select appropriate values.
Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:
```
touch dataset_list_msg.txt
```
Bring up the containers:
```
docker-compose up -d
```

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster installation

Clusters must have at least a primary node and two additional nodes.

Primary node

Create data directories on a volume with adequate storage. Note that in order to use the Spark loader, the full_datasets and datasets_loading directories (see below) will need to be shared between the primary and cluster nodes as an NFS mount. (The other directories do not need to be shared.)
```
mkdir -p /tweetset_data/redis
mkdir -p /tweetset_data/datasets
mkdir -p /tweetset_data/full_datasets
mkdir -p /tweetset_data/elasticsearch
chown -R 1000:1000 /tweetset_data/elasticsearch
```
Create a directory, to be named as you choose, where tweet data files will be stored for loading.
```
mkdir /datasets_loading
```
Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.

Clone or download this repository:

git clone https://github.com/gwu-libraries/TweetSets.git

Change to the docker directory:
```
cd docker
```

Copy the example docker files:

cp example.cluster-primary.docker-compose.yml docker-compose.yml
cp example.env .env

Update .env. This file is annotated to help you select appropriate values.
Create dataset_list_msg.txt in the docker directory. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:
```
touch dataset_list_msg.txt
```

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster node(s)

Create data directories on a volume with adequate storage:

mkdir -p /tweetset_data/elasticsearch
mkdir -p /tweetset_data/full_datasets
chown -R 1000:1000 /tweetset_data/elasticsearch
mkdir /datasets_loading

Clone or download this repository:

git clone https://github.com/gwu-libraries/TweetSets.git

Set up the tweetset_data/full_datasets and datasets_loading NFS mounts as described here.
Change to the docker directory:
```
cd docker
```

Copy the example docker files:

cp example.cluster-node.docker-compose.yml docker-compose.yml
cp example.cluster-node.env .env

Edit .env. This file is annotated to help you select appropriate values. Note that 2 cluster nodes must have MASTER set to true.
Bring up the containers, starting with the cluster nodes and then moving to the primary node.
```
docker-compose up -d
```

Loading a source dataset

Prepping the source dataset

Create a dataset directory within the dataset filepath configured in your .env.
Place tweet files in the directory. The tweet files can be line-oriented JSON (.json) or gzip compressed line-oriented JSON (.json.gz).
Create a dataset description file in the directory named dataset.json. See example.dataset.json for the format of the file.

Loading

Use this method when Elasticsearch is on the same machine as TweetSets (non-cluster option), or for otherwise loading without using Spark.

Start and connect to a loader container:

docker-compose run --rm loader /bin/bash

Invoke the loader:

python tweetset_loader.py create /dataset/path/to

To see other loader commands:

    python tweetset_loader.py

Note that tweets are never added to an existing index. When using the reload command, a new index is created for a dataset that replaces the existing index. The new index replaces the old index only after the new index has been created, so users are not affected by reloading.

Loading with Apache Spark

When using the Spark loader, the dataset files must be located at the dataset filepath on all nodes. All nodes must also have access to shared directory (tweetset_data/full_datasets) for creating the full extracts. For creating full extracts, this process is more efficient than the method described below ("Creating a manual extract").

In general, using Spark within Docker is tricky because the Spark driver, Spark master, and Spark nodes all need to be able to communicate and the ports are dynamically selected. (Some of the ports can be fixed, but supporting multiple simultaneous loaders requires leaving some dynamic.) This doesn't play well with Docker's port mapping, since the hostnames and ports that Spark advertises internally must match what is available through Docker. Further complicating this is that host networking (which is used to support the dynamic ports) does not work correctly on Mac. Use the regular loader rather than the Spark loader Elasticsearch is on the same machine as TweetSets (e.g., in a small development environment, not a cluster).

Cluster mode

Start and connect to a loader container:

docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash

Invoke the loader:

spark-submit \
--jars elasticsearch-hadoop.jar \
--master spark://$SPARK_MASTER_HOST:7101 \
--py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
--conf spark.driver.bindAddress=0.0.0.0 \
--conf spark.driver.host=$SPARK_DRIVER_HOST \
--conf spark.driver.port=7003 \
--conf spark.blockManager.port=7020 \
tweetset_loader.py spark-create /dataset/path/to

Extracts will be stored in /tweetset_data/full_datasets and will be visible in the UI.

Reloading an existing set with Apache Spark

Start and connect to a loader container:

docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash

Invoke the loader:

spark-submit \
--jars elasticsearch-hadoop.jar \
--master spark://$SPARK_MASTER_HOST:7101 \
--py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
--conf spark.driver.bindAddress=0.0.0.0 \
--conf spark.driver.host=$SPARK_DRIVER_HOST \
--conf spark.driver.port=7003 \
--conf spark.blockManager.port=7020 \
tweetset_loader.py spark-reload dataset-id /dataset/path/to

where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.

Note that running spark-reload does not re-read dataset.json and update the dataset descriptive metadata. To update the dataset descriptive metadata to match dataset.json if it has been changed, invoke the loader with an update command:

    spark-submit \
    --jars elasticsearch-hadoop.jar \
    --master spark://$SPARK_MASTER_HOST:7101 \
    --py-files dist/TweetSets-2.2.0-py3.8.egg,dependencies.zip \
    --conf spark.driver.bindAddress=0.0.0.0 \
    --conf spark.driver.host=$SPARK_DRIVER_HOST \
    --conf spark.driver.port=7003 \
    --conf spark.blockManager.port=7020 \
    tweetset_loader.py update dataset-id /dataset/path/to

Creating a manual extract (dataset)

Full extracts of existing datasets can be created from the command line. Note that this command does not use the Spark loader, so will not generate mentions or user tweet count files. It generates .zip versions of JSON, CSV, and IDs.

Launch a shell session in the server container:

docker exec -it ts_server_1 /bin/bash

docker exec -it ts_server-flaskrun_1 /bin/bash

Issue the command to create the extract, where dataset-id is the id of the dataset, which can be found by viewing the collection's ID metadata field via the Tweetsets UI.

flask create-extract dataset_id

Upon completion, an email will be sent to the address in the ADMIN_EMAIL field of the .env file.

Kibana

Elastic's Kibana is a general-purpose framework for exploring, analyzing, and visualizing data. Since the tweets are already indexed in ElasticSearch, they are ready to be used from Kibana.

To enable Kibana, uncomment the Kibana service in your docker-compose.yml. By default, Kibana will run on port 5601.

A few notes about Kibana:

When starting Kibana, the first step you will need to do is select an index pattern. Each index represents a dataset, where the format of the name of the index is tweets-. The dataset id is available under the dataset details when selecting source datasets in TweetSets.
The time period of the tweets is controlled by the date picker on the top, right of the Kibana screen. By default the time period is very short; you will probably want to adjust to cover a longer time period.

Citing

Please cite TweetSets as:

    Justin Littman, Laura Wrubel, Dan Kerchner, Dolsy Smith, Will Bonnett. (2020). TweetSets. Zenodo. https://doi.org/10.5281/zenodo.1289426

Development

Unit tests

Run outside the container.

python -m unittest

The Spark loader has its own set of unit tests. These will be copied to the TweetSets/tests directory when creating the loader container. Run them within the loader container with python -m unittest.

Kibana TODO

Consider multiple Kibana users.
Consider persistence.
Provide a default dashboard.
Consider approaches to index patterns.

TweetSets TODO

Loading:
- Hydration of tweet ids lists.
Limiting:
- Limit by mention user ids
- Limit by user ids
- Limit by verified users
Scroll additional sample tweets
Dataset derivatives:
- Additional top derivatives:
- URL
- Quotes/retweets
- Options to limit top derivatives by:
- Top number (e.g., top 500)
- Count greater than (e.g., more than 5 mentions)
- Additional nodes/edges derivatives:
- Replies
- Quotes/retweets
- Provide nodes/edges in additional formats such as Gephi.
Separate counts of tweets available for public / local on home page.

gwu-libraries / TweetSets

readme

TweetSets

Modes

Installing

Prerequisites

Installation for non-cluster ElasticSearch

Cluster installation

Primary node

Cluster node(s)

Loading a source dataset

Prepping the source dataset

Loading

Loading with Apache Spark

Cluster mode

Reloading an existing set with Apache Spark

Creating a manual extract (dataset)

Kibana

Citing

Development

Unit tests

Kibana TODO

TweetSets TODO