rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

THIS VERSION IS DEPRECATED GO TO https://github.com/SchwartzLabURI/SISRS/

SISRS

SISRS: Site Identification from Short Read Sequences
Version 1.6.2
Copyright (c) 2013-2016 Rachel Schwartz Rachel.Schwartz@asu.edu
https://github.com/rachelss/SISRS
More information: Schwartz, R.S., K.M Harkins, A.C. Stone, and R.A. Cartwright. 2015. A composite genome approach to identify phylogenetically informative data from next-generation sequencing. BMC Bioinformatics. 16:193. (http://www.biomedcentral.com/1471-2105/16/193/)

Talk from Evolution 2014 describing SISRS and its application:
https://www.youtube.com/watch?v=0OMPuWc-J2E&list=UUq2cZF2DnfvIUVg4tyRH5Ng

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

Requirements

Input

Next-gen sequence data such as Illumina HiSeq reads. Data must be sorted into folders by taxon (e.g. species or genus). Paired reads in fastq format must be specified by _R1 and _R2 in the (otherwise identical) filenames. Paired and unpaired reads must have a fastq file extension.

Running SISRS

Usage:

sisrs command options

By default, SISRS assumes that

Commands:

sites: produce an alignment of sites from raw reads

loci: produce a set of aligned loci based on the most variable regions of the composite genome

Subcommands of sites:

subSample: run sisrs subsampling scheme, subsampling reads from all taxa to ~10X coverage across species, relative to user-specified genome size

buildContigs: given subsampled reads, run sisrs composite genome assembly with user-specified assembler

alignContigs: align reads to composite genome as single-ended, uniquely mapped

mapContigs: align composite genome reads to a reference genome (optional)

identifyFixedSites: find sites with no within-taxa variation

outputAlignment: output alignment file of sisrs sites

changeMissing: given alignment of sites (alignment.nex), output a file with only sites missing fewer than a specified number of samples per site

Option Flags:

Output

Nexus file with variable sites in a single alignment. Usable in most major phylogenetics software as a concatenated alignment with a setting for variable-sites-only.

Test Data

The folder test_data (https://github.com/rachelss/SISRS_test_data) contains simulated data for 10 species on the tree found in simtree.tre . Using 40 processors this run took 9 minutes. Analysis of the alignment output by sisrs using raxml produced the correct tree.

Sample commands

  1. Basic sisrs run: start with fastq files and produce an alignment of variable sites
    sisrs sites -g 1745690
  2. Basic sisrs run with modifications
    sisrs sites -g 1745690 -p 40 -m 4 -f /usr/test_data -z /usr/output_data -t .99 -a minia
  3. Run only sisrs read subsampling step
    sisrs subSample -g 1745690 -f /usr/test_data -c 0
  4. Produce an alignment of loci based on the most variable loci in your basic sisrs run. Note - this command will run sisrs sites if (and only if) it was not run previously.
    sisrs loci -g 1745690 -p 40 -l 2 -f /usr/test_data           # Will run sites first, then loci
    sisrs loci -g 1745690 -p 40 -l 2 -f /usr/SISRS_sites_ouput   # Will run loci from previous sites data
  5. Get loci from your fastq files given known loci.

    first name your reference loci ref_genes.fa and put in your main folder

    sisrs loci -p 40 -f /usr/test_data