nextstrain / sacra

Cleaning scripts for real-time pathogen analysis
1 stars 3 forks source link

Sacra

Sacra: a data cleaning tool designed for genomic epidemiology datasets.

Sacra is used primarily within Nextstrain and replaces functionality previously found in nextstrain/fauna. This is under development and not production ready.

The general idea is to take possibly messy* data of varying input types (FASTA, CSV, JSON, accession numbers, titer tables), collect, clean and merge the data into a JSON output. Sacra is idempotent, i.e. sacra(sacra(file)) == sacra(file). Uploading to a database is not part of sacra (see nextstrain/flora).

Requirements

Input file types

How To Run

Command line syntax

Prior to running:

Running on a FASTA or JSON:

optional arguments:
  -h, --help            show this help message and exit
  --debug               Enable debugging logging
  --files [FILES [FILES ...]]
                        file types: text (list of accessions), FASTA, (to do)
                        FASTA + CSV, (to do) JSON
  --pathogen PATHOGEN   This sets the config file
  --accession_list [ACCESSION_LIST [ACCESSION_LIST ...]]
                        list of strings to query genbank with
  --outfile OUTFILE
  --visualize_call_graph
                        draw a graph of calls being made
  --call_graph_fname CALL_GRAPH_FNAME
                        filename for call graph

entrez:
  --skip_entrez         Query genbank for all accessions to help clean /
                        correct metadata data

overwrites:
  --overwrite_fasta_header OVERWRITE_FASTA_HEADER
                        Overwrite the config-defined FASTA header

Adding new pathogens

To perform a Sacra run on a pathogen that is not currently supported, or to update the behavior of Sacra on supported pathogens, you will need to make/edit a <pathogen_name>.py file in the sacra/configs directory.

Supported pathogens:

How To Run (testing)