medialab / gazouilloire

Twitter stream + search API grabber
GNU General Public License v3.0
104 stars 17 forks source link

DOI

logo logo

A command line tool for long-term tweets collection. Gazouilloire combines two methods to collect tweets from the Twitter API ("search" and "filter") in order to maximize the number of collected tweets, and automatically fills the gaps in the collection in case of connexion errors or reboots. It handles various config options such as:

Python >= 3.7 compatible.

Your Twitter API keys must have been created before April 29, 2022, in order to fully use the tool. If your keys were created after that date, Gazouilloire will work with the "search" endpoint only, and not the "filter". See Twitter's documentation about this issue.

Summary

Installation

Quick start

Note that ElasticSearch's databases names must be lowercased and without any space or accented character.

Disk space

Before starting the collection, you should make sure that you will have enough disk space. It takes about 1GB per million tweets collected (without images and other media contents).

You should also consider starting gazouilloire in multi-index mode if the collection is planed to exceed 100 million tweets, or simply restart your collection in a new folder and a new db_name (i.e. open another ElasticSearch index) if the current collection exceeds 150 million tweets.

As a point of comparison, here is the number of tweets sent during the whole year 2021 containing certain keywords (the values were obtained with the API V2 tweets count endpoint):

Query Number of tweets in 2021
lemondefr lang:fr 3 million
macron lang:fr 21 million
vaccine 176 million

Export the tweets in CSV format

Data is stored in your ElasticSearch, which you can direcly query. But you can also export it easily in CSV format:

# Export all fields from all tweets, sorted in chronological order:
gazou export

Sort tweets

By default, tweets are sorted in chronological order, using the "timestamp_utc" field. However, you can speed-up the export by specifying that you do not need any sort order:

gazou export --sort no

You can also sort tweets using one or several other sorting keys:

gazou export --sort collection_time

gazou export --sort user_id,user_screen_name

Please note that:

Write into a file

By default, the export command writes in stdout. You can also use the -o option to write into a file:

gazou export > my_tweets_file.csv
# is equivalent to
gazou export -o my_tweets_file.csv

Although if you interrupt the export and need to resume it to complete in multiple sequences, only the -o option will work with the --resume option.

Query specific keywords

Export all tweets containing "medialab" in the text field:

gazou export medialab

The search engine is not case sensitive and it escapes # or @: gazou export sciencespo will export tweets containing "@sciencespo" or "#SciencesPo". However, it is sensitive to accents: gazou export medialab will not return tweets containing "médialab".

Use lucene query syntax with the --lucene option in order to write more complex queries:

Other available options:


# Get documentation for all options of gazou export (-h or --help)
gazou export -h

# By default, the export will show a progressbar, which you can disable like this:
gazou export --quiet

# Export a csv of all tweets between 2 dates or datetimes (--since is inclusive and --until exclusive):
gazou export --since 2021-03-24 --until 2021-03-25
# or
gazou export --since 2021-03-24T12:00:00 --until 2021-03-24T13:00:00

# List all available fields for each tweet:
gazou export --list-fields

# Export only a selection of fields (-c / --columns or -s / --select the xsv way):
gazou export -c id,user_screen_name,local_time,links
# or for example to export only the text of the tweets:
gazou export --select text

# Exclude tweets collected via conversations or quotes (i.e. which do not match the keywords defined in config.json)
gazou export --exclude-threads

# Exclude retweets from the export
gazou export --exclude-retweets

# Export all tweets matching a specific ElasticSearch term query, for instance by user name:
gazou export '{"user_screen_name": "medialab_ScPo"}'

# Take a csv file with an "id" column and export only the tweets whose ids are included in this file:
gazou export --export-tweets-from-file list_of_ids.csv

# You can of course combine all of these options, for instance:
gazou export medialab --since 2021-03-24 --until 2021-03-25 -c text --exclude-threads --exclude-retweets -o medialab_tweets_210324_no_threads_no_rts.csv

Count collected tweets

The Gazouilloire query system is also available for the count command. For example, you can count the number of tweets that are retweets:

gazou count --lucene retweeted_id:*

You can also use the --step parameter to count the number of tweets per seconds/minutes/hours/days/months/years:

gazou count medialab --step months --since 2018-01-01 --until 2022-01-01

The result is written in CSV format.

Export/Import data dumps directly with ElasticSearch

In order to run and reimport backups, you can also export or import data by dialoguing directly with ElasticSearch, with some of the many tools of the ecosystem built for this.

We recommend using elasticdump, which requires to install NodeJs:

# Install the package
npm install -g elasticdump

Then you can use it directly or via our shipped-in script elasticdump.sh to run simple exports/imports of your gazouilloire collection indices:

gazou scripts elasticdump.sh
# and to read its documentation:
gazou scripts --info elasticdump.sh

Advanced parameters

Many advanced settings can be used to better filter the tweets collected and complete the corpus. They can all be modified within the config.json file.

- keywords

Keywords syntax follow Twitter's search engine rules. You can forge your queries by typing them within the website's search bar. You can input a single word, or a combination of ones separated by spaces (which will query for tweets matching all of those words). You can also write complex boolean queries such as (medialab OR (media lab)) (Sciences Po OR SciencesPo) but note only the Search API will be used for these ones, not the Streaming API, resulting in less exhaustive results.

Some advanced filters can be used in combination with the keywords, such as -undesiredkeyword, filter:links, -filter:media, -filter:retweets, etc. See Twitter API's documentation for more details. Queries including these will also only run on the Search API and not the Streaming API.

When adding a Twitter user as a keyword, such as "@medialab_ScPo", Gazouilloire will query specifically "from:medialab_Scpo OR to:medialab_ScPo OR @medialab_ScPo" so that all tweets mentionning the user will also be collected.

Using upper or lower case characters in keywords won't change anything.

You can leave accents in queries, as Twitter will automatically return both tweets with and without accents through the search API, for instance searching "héros" will find both tweets with "heros" and "héros". The streaming API will only return exact results but it mostly complements the search results.

Regarding hashtags, note that querying a word without the # character will return both tweets with the regular word and tweets with the hashtag. Adding a hashtag with the # characters inside keywords will only collect tweets with the hashtag.

Note that there are three possibilities to filter further:

- language

In order to collect only tweets written in a specific language: just add "language": "fr" to the config (the language should be written in ISO 639-1 code)

- geolocation

Just add "geolocation": "Paris, France" field to the config with the desired geographical boundaries or give in coordinates of the desired box (for instance [48.70908786918211, 2.1533203125, 49.00274483644453, 2.610626220703125])

- time_limited_keywords

In order to filter on specific keywords during planned time periods, for instance:

  "time_limited_keywords": {
        "#fosdem": [
            ["2021-01-27 04:30", "2021-01-28 23:30"]
        ]
    }

- url_pieces

To search for specific parts of websites, one can input pieces of urls as keywords in this field. For instance:

  "url_pieces": [
      "medialab.sciencespo.fr",
      "github.com/medialab"
  ]

- resolve_redirected_links

Set to true or false to enable or disable automatic resolution of all links found in tweets (t.co links are always handled, but this allows resolving also for all other shorteners such as bit.ly).

The resolving_delay (set to 30 by default) defines for how many days urls returning errors will be retried before leaving them as such.

- grab_conversations

Set to true to activate automatic recursive retrieval within the corpus of all tweets to which collected tweets are answering (warning: one should account for the presence of these when processing data, it often results in collecting tweets which do not contain the queried keywords and/or which are way out of the collection time period).

- catchup_past_week

Twitter's free API allows to collect tweets up to 7 days in the past, which gazouilloire does by default when starting a new corpus. Set this option to false to disable this and only collect tweets posted after the collection was started.

- download_media

Configure this option to activate automatic downloading within media_directory of photos and/or videos posted by users within the collected tweets (this does not include images from social cards). For instance the following configuration will only collect pictures without videos or gifs:

  "download_media": {
      "photo": true,
      "video": false,
      "animated_gif": false,
      "media_directory": "path/to/media/directory"
  }

All fields can also be set to true to download everything. media_directory is the folder where Gazouilloire stores the images & videos. It should either be an absolute path ("/home/user/gazouilloire/my_collection/my_images"), or a path relative to the directory where config.json is located ("my_images").

- timezone

Adjust the timezone within which tweets timestamps should be computed. Allowed values are proposed on Gazouilloire's startup when setting up an invalid one.

- verbose

When set to true, logs will be way more explicit regarding Gazouilloire's interactions with Twitter's API.

Daemon mode

For production use and long term data collection, Gazouilloire can run as a daemon (which means that it executes in the background, and you can safely close the window within which you started it).

Reset

Development

To install Gazouilloire's latest development version or to help develop it, clone the repository and install your local version using the setup.py file:

git clone https://github.com/medialab/gazouilloire
cd gazouilloire
python setup.py install

Gazouilloire's main code relies in gazouilloire/run.py in which the whole multiprocess architecture is orchestrated. Below is a diagram of all processes and queues.

All three queues are backed up on filesystem in pile_***.json files to be reloaded at next restart whenever Gazouilloire is shut down.

multiprocesses

Troubleshooting

ElasticSearch

Publications

Gazouilloire presentations

Publications using Gazouilloire

Publications talking about Gazouilloire

Credits & License

Benjamin Ooghe-Tabanou, Béatrice Mazoyer, Jules Farjas & al @ Sciences Po médialab

Read more about Gazouilloire's migration from Python2 & Mongo to Python3 & ElasticSearch in Jules' report.

Discover more of our projects at médialab tools.

This work has been supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.