mozilla / Bugzilla-ETL

ETL for feeding bug snapshots to an ElasticSearch cluster
Mozilla Public License 2.0
5 stars 9 forks source link


Extract Bugzilla change history; Transform into bug snapshots; and Load into Elasticsearch


If you are here because the Mozilla's instance is down, please read the Operation Support Document

Motivation and Details



Python and SetupTools are required. It is best you install on Linux, but if you do install on Windows please [follow instructions to get these installed] (
When done, installation is easy:

git clone

then install requirements:

cd Bugzilla-ETL
pip install -r requirements.txt

WARNING: pip install Bugzilla-ETL does not work - I have been unable to get Pip to install resource files consistently across platforms and Python versions.

Installation with PyPy

PyPy will execute 4 to 5 times faster then CPython. PyPy maintains its own environment, and its own version of the module binaries. This means running SetupTools is just a little different. After

git clone

then install requirements with PyPy's version of pip:

cd Bugzilla-ETL
c:\PyPy27\bin\pip.exe install -r requirements.txt

Despite my Windows example, the equivalent must be done in Linux.


You must prepare a settings.json file to reference the resources, and its filename must be provided as an argument in the command line. Examples of settings files can be found in resources/settings

Inter-Run State

Bugzilla-ETL keeps local run state in the form of two files: first_run_time and last_run_time. These are both parameters in the `settings.json file.

Alias Analysis

You will require an alias file that matches the various email addresses that users have over time. This analysis is necessary for proper CC list history and patch review history. More on alias analysis.


Asuming your settings.json file is in ~/Bugzilla_ETL:

cd ~/Bugzilla_ETL

pypy bugzilla_etl\ --settings=settings.json

Use --help for more options, and see example command line script

Got it working?

The initial ETL will take over two hours. If you want something quicker to confirm your configuration is correct, use --reset --quick arguments on the command line. This will limit ETL to the first 1000, and last 1000 bugs.

cd ~/Bugzilla_ETL
pypy bugzilla_etl\  --settings=settings.json --reset --quick

Using Cron

Bugzilla-ETL is meant to be triggered by cron; usually every 10 minutes. Bugzilla-ETL limits itself to only one instance per settings.json file: That way, if more then one instance is accidentally run, the subsequent instances will do no work and shutdown cleanly.

Running Tests

The Git clone will include test code. You can run those tests, but you must...

python -m pip install virtualenv
cd ~/Bugzilla-ETL

python -m virtualenv .env
pip install -r requirements.txt
set PYTHONPATH=.;vendor

python -m unittest discover -v -s tests

Fixing tests

Test runs are compared to documents found in the reference files at tests/resources/reference. They may need updating after changing the code.

python -m unittest test_examples 

The output file is found in tests/results, and can replace the reference file. Be sure to review the git diff; it will show the change in the reference file, just to be sure nothing went wrong.


There may be enhancements from time to time. To get them

cd ~/Bugzilla-ETL
git pull origin master
pip install -r requirements.txt

After upgrading the code, you may want to trigger a full ETL. To do this, you may either

  1. run with the --reset flag directly, or
  2. remove the first_run_time file (and the next cron event will trigger a full ETL)

Submitting Bugs

We use Bugzilla for tracking bugs. If you want to submit a bug or feature request, please add a dependency to BZ ETL Metabug

More on ElasticSearch

If you are new to ElasticSearch, I recommend using ElasticSearch Head for getting cluster status, current schema definitions, viewing individual records, and more. Clone it off of GitHub, and open the index.html file from in your browser. Here are some alternate instructions.