therealcooperpark / hero

Highways Enumerated by Recombination Observations
MIT License
5 stars 1 forks source link

HERO

Highways Enumerated by Recombination Observations

HERO is a pipeline, written in Python, designed to parse and visualize highways of genome-wide homologous recombination between user-defined metadata groups using the output of the recombination detection tool fastGEAR. Currently, it works in 3 stages:

1) For each recombination event, HERO comapres the sequence similarity between the recombined DNA sequence and a pool of potential donor genomes (determined by the clustering algorithm found in fastGEAR) to identify the most likely donor metadata-group (defined by the user).

2) Calculates which recombining metadata-group pairs (a donor and its recipient) have statistically high rates of recombination within the measured population. A standard Highway is defined using a strict Interquartile fence which equals 3*IQR + Q3 where IQR is the interquartile range and Q3 is the third quartile of the events per pair.

3) Generates a number of files detailing specific information on recombination events and metadata-group pairs, as well as several files related to visualizing the network of recombination using Circos

Sidekick

Because fastGEAR is most effective at predicting recombination on a gene-by-gene basis (as opposed to a concatenated gene alignment), a standard use of fastGEAR involves generating a pan-genome of the population using a program such as Roary and then running fastGEAR on each gene alignment. To accomodate this workflow, we also provide sidekick.py to pre-process the individual gene alignments generated by Roary (using the '-z' parameter) into files that can be used by HERO. Briefly, sidekick will:

1) Use the original GFF files provided to Roary to replace the gene-specific headers in the Roary gene alignments with proper genome IDs to unify fastGEAR results across genes.

2) Iteratively run fastGEAR on each modified gene alignment with optional multithreading for speed.

3) Prepare the primary input file necessary for HERO to process the new fastGEAR data.

INSTALLATION (TESTED ON UBUNTU 20.04.1, should work on most Linux Distributions)

HERO (and Sidekick) has the following dependencies:

Download command:

git clone https://github.com/therealcooperpark/hero.git

USAGE

Tutorial

Explore a real-world walkthrough of a typical workflow to use HERO including pre-made output files each step of the way: https://github.com/therealcooperpark/hero_example

hero.py

usage: hero.py --hero_table [table] --groups [groups_file] [options]

HERO - Highways Elucidated by Recombination Observations

optional arguments:
  -h, --help                  show this help message and exit
  --hero_table HERO_TABLE     HERO input table
  --groups GROUPS             Tab-deliminated file with genomes in 1st column and groups in 2nd
  -o OUTDIR, --outdir OUTDIR  Output directory [hero_results]
  -c CPUS, --cpus CPUS        CPUs to use [1]
  -l LENGTH, --length LENGTH  Minimum length required to process recomb event [0]
  -b BAYES, --bayes BAYES     Minimum bayes factor required to process recomb event [10]

The format for the --hero_table file is:

gene1 path/to/fasta  path/to/fastgear
gene2 path/to/fasta2 path/to/fastgear2
...

for every gene that you want to measure recombination from. path/to/fasta is the path to the single gene alignment file and path/to/fastgear is the path to a directory contianing the output from the fastgear run for that alignment file.

This table will automatically be created as one output from sidekick.py.

sidekick.py

usage: sidekick.py [options] gff_table

Use before HERO to convert Roary FASTA alignment headers to genome names

positional arguments:
  gff_table              Tab-delimited file of GFF file location and associated genome for renaming

optional arguments:
  -h, --help             show this help message and exit
  --alns ALNS            Filepath to Roary pan_genome_sequences directory (requires -z argument) [./pan_genome_sequences]
  --output OUTPUT        Output directory name [sidekick_genes]
  --cpus CPUS            Number of CPUS to use [1]

fastGEAR:
  --fastgear FASTGEAR    Filepath to "run_fastGEAR.sh" script provided by fastGEAR. Must be used with --mcr
  --mcr MCR              Filepath to MCR executable. Must be used if using --fastgear
  --fgout FGOUT          Output directory name for fastgear runs [fastgear_genes]
  --iters ITERS          Number of iterations [15]
  --bounds BOUNDS        Upper bound for number of clusters [10]
  --partition PARTITION  File containing a partition for strains [NA]
  --fg_output FG_OUTPUT  1=reduced output, 0=complete output

The format for the gff_table file is:

genome1_gff_filepath  genome_name1
genome2_gff_filepath  genome_name2
...

for every genome in the original Roary pan-genome.

OUTPUT

summary_stats.txt

Basic information about amount of recombination detected.

recombination_events.txt

Tab-delimited table of each recombination event including donor group, recipient group, start/end position of event, gene name, and a list of recipient genomes with evidence for the event.

recombination_pairs.txt

Tab-delmited table of all unique metadata group pairs and the number of recombination events between them.

fragment_sizes.svg/txt

A histogram and table of the fragment size of each event.

gene_counts.svg/txt

A histogram and table of the number of recombination events per gene.

recipient_counts.svg/txt

A histogram and table of the number of recombination events per recipient genome.

circos.png/svg

PNG and SVG formatted circos networks visualizing the network of recombination measured by HERO. See below for details on interpretting the figure.

highway_circos.png/svg

PNG and SVG formatted circos networks highlighting highways of recombination detected by HERO.

circos.conf/circos_karyotype.txt/circos_links.txt/highway_circos.conf

Configuration files created by HERO to create the circos plots.

QUESTIONS

Please submit suggestions and bug reports to the Issue Tracker

CITATION

If you use this program, please cite: Park C, Andam C. HERO Github https://github.com/therealcooperpark/hero