nextstrain / forecasts-ncov

SARS-CoV-2 variant growth rates and frequency forecasts
https://nextstrain.org/sars-cov-2/forecasts/
7 stars 2 forks source link
bioinformatics forecasts nextstrain pango-lineages pathogen sars-cov-2 sars-cov-2-variants

Forecasts SARS-CoV-2

:warning: WARNING: This is an alpha release. Output file format and address may change at any time

This repo forms the basis of our continually-updated modelling of SARS-CoV-2 variant frequencies. Broadly speaking, the moving pieces in this repo are:

Automated pipeline

The automated pipeline runs daily based on a scheduled jobs and triggers from upstream data ingests. We use GitHub actions to schedule these jobs, often with one job triggering another upon completion.

Inputs

See available counts files for the input case counts and clade counts files.

Outputs

The model results for GISAID data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/gisaid. The model results for open (GenBank) data are stored at s3://nextstrain-data/files/workflows/forecasts-ncov/open.

The latest results are stored as latest_results.json and previously uploaded results can be found as <YYYY-MM-DD>_results.json.

Summary of Available files

Data Provenance Variant Classification Geographic Resolution Model Address
GISAID Nextstrain clades Global MLR https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/nextstrain_clades/global/mlr/latest_results.json
Pango lineages https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json
open (GenBank) Nextstrain clades https://data.nextstrain.org/files/workflows/forecasts-ncov/open/nextstrain_clades/global/mlr/latest_results.json
Pango lineages https://data.nextstrain.org/files/workflows/forecasts-ncov/open/pango_lineages/global/mlr/latest_results.json

Installation

Please follow installation instructions for Nextstrain's software tools.

Usage

To run pipeline for all available data generated by ingest:

nextstrain build .

To run the pipeline for specific data provenance, variant classification and geo resolution (e.g. gisaid, nextstrain_clades and global only):

nextstrain build . --configfile config/config.yaml --config data_provenances=gisaid variant_classification=nextstrain_clades geo_resolutions=global

Optional uploads

To run the pipeline that uploads the model results to S3 and sends Slack notifications:

nextstrain build . --configfile config/config.yaml config/optional.yaml

OR

Run the GitHub Action workflow named "Run models" to run the pipeline on AWS Batch.

Configuration

The data_provenances, variant_classifications and geo_resolutions are required configs for the pipeline.

The current available options for data_provenances are

The current available options for variant_classifications are

The current available options for geo_resolutions are

Data Prep Configurations

The prepare_data params in config/config.yaml are used to subset the full case counts and clades counts data to specific date range, locations, and clades.

As of 2023-04-04, the config for the automated pipeline is set to only include data from:

Model configurations

The specific model configurations are housed in separate config YAML files or each model. These separate config files must be provided in the main config as mlr_config and renewal_config in order to run the models. By default, the model config files used are config/mlr-config.yaml and config/renewal-config.yaml. Note the inputs and outputs for the models are overridden in the Snakemake pipeline to conform to the Snakemake input/output framework.

Clade and Lineage colours

Model JSONs are post processed by ./scripts/modify-lineage-colours-and-order.py. For nextstrain_clades this sets the colours and display names. For pango_lineages this orders lineages based on their full (unaliased) pango designation, and sets colours based on the associated nextstrain clade.

When new clades are added please modify the CLADES definitions in the script accordingly.

Environment variables

No environment variables are required for open data. However, the following environment variables are required for the gisaid data:

Uploads

If running pipeline with uploads to S3, the following environment variables are required (regardless of data provenance):

Slack notifications

If running pipeline with Slack notifications, the following environment variables are required: