This project provides a program with a command-line interface for detecting microsatellite instability by next-generation sequencing. This project is free for use by Academic users. Please see enclosed license agreement for details and conditions of use.
Citation: ^^^^^^^^^ Salipante SJ, Scroggins SM, Hampel HL, Turner EH, Pritchard CC. 2014. Microsatellite Instability Detection by Next Generation Sequencing. Clin Chem. 2014 Sep;60(9):1192-9. doi: 10.1373/clinchem.2014.223677. Epub 2014 Jun 30.
Authors: ^^^^^^^^
Dependencies: ^^^^^^^^^^^^^
Please ensure the dependencies are installed to the system. This pipeline is designed to be run inside its own virtual environment (virtualenv) which contains all required dependencies. The following commands will allow you to establish this virtualenv.
To install the software to a virtual environment in your current directory, run the following commands:
git clone https://bitbucket.org/uwlabmed/msings.git cd msings bash dev/bootstrap.sh
To use the virtualenv, including the programs (samtools, msi), run:
source msings-env/bin/activate
To install the software (including creating a virtualenv) to a path of your choosing, run the following commands:
git clone https://bitbucket.org/uwlabmed/msings.git cd msings bash dev/bootstrap.sh /desired/virtual/path
NOTE: Downloading the zipped file through this website will cause installation problems and is not supported. Please CLONE the repo.
The devel.sh script builds a local virtualenv and downloads test data (if run from UW):
git clone https://bitbucket.org/uwlabmed/msings.git cd msings bash dev/devel.sh
If run outside of the UW network, it will not download test data.
bam file : sample of interest aligned against reference genome, provided in bam format. Index required.
ref_fasta : fasta file alignment was created with - in other words, your reference genome. Must be indexed for use with samtools program (see below). Please note that both your reference genome and bed files MUST follow the convention that chromosomes are numbered numerically or with "X" or "Y". Other conventions (such as those bearing a "chr" prefix) are not supported.
msi_bed : MSI bed file (see example under "doc/mSINGS_TCGA.bed") - specifies the locations of the microsatellite tracts of interest. In order to be compatible with the genomic coordinate systems used by samtools mpileup, the coordinates of microsatellite tracts in bed file should start at the position before the first base in the tract. NOTE: must be sorted numerically and must not have a header line (see below), must be 0 indexed, and must follow identical chromosome naming conventions as the reference genome.
msi_baseline : MSI baseline file (see example under "doc/mSINGS_TCGA.baseline") - describes the average and standard deviation of the number of expected signal peaks at each locus, as calculated from an MSI negative population (blood samples or MSI negative tumors). User generates this file with msi create_baseline script (see below). IMPORTANT NOTE: Baseline statistics vary markedly from assay-to-assay and lab-to-lab. It is CRITICAL that you prepare a baseline file that is specific for your analytic process, and for which data have been generated using the same protocols.
For each sample run, the following will be produced:
For the entire run, a "top level" output represented as a binary matrix of interpreted instability (1) or stability (0) at each locus is provided if the count_msi.py function is run. Loci with insufficient coverage for instability calling are left blank. Summary statistics and interpretation of results are provided.
This protocol will run the pipeline using the baseline file and microsatellite loci identified for TCGA exome data. Files specific for analysis of TCGA exome data are provided in the doc/ directory of this package.
source /path/to/your/msings-virtual-environment/bin/activate
REF_GENOME=/path/to/REF_GENOME;
multiplier=2.0 "multiplier" is the number of standard deviations from the baseline that is required to call instability
msi_min_threshold=0.2 "msi_min_threshold" is the maximum fraction of unstable sites allowed to call a specimen MSI negative
msi_max_threshold=0.2 "msi_max_threshold" is the minimum fraction of unstable sites allowed to call a specimen MSI positive
/path/to/sampleA.bam /path/to/sampleB.bam /path/to/sampleC.bam
Default execution:
scripts/run_msings.sh PATH/TO/BAM_LIST PATH/TO/BEDFILE PATH/TO/REF_GENOME PATH/TO/MSI_BASELINE
If you already edited the run_msings.sh script to point to the reference files (either yours or the TCGA files in the doc/ folder), script may be run as follows:
scripts/run_msings.sh PATH/TO/BAM_LIST
Files specific for analysis of TCGA exome data are provided in the doc/ directory of this package. To run mSINGS analysis use custom assays or custom targets, users are required to provide 2 custom files:
NOTE: loci PRESENT in the bed file that are ABSENT in the baseline file (created in step 8 below) will not be scored! NOTE: In order to be compatible with the genomic coordinate systems used by samtools mpileup, the coordinates of microsatellite tracts in bed file should start at the position before the first base in the tract.
The following instructions will allow users to set up analysis for their custom targets, to generate a custom baseline for those targets, and to run subsequent analysis. Recommendations for design of custom assays and custom targets are provided in the Recommendations_for_custom_assays.txt file packaged with the repository.
scripts/create_baseline.sh:
source /path/to/your/msings-virtual-environment/bin/activate
scripts/run_msings.sh:
source /path/to/your/msings-virtual-environment/bin/activate
samtools faidx ref_fasta
BEDFILE=/path/to/CUSTOM_MSI_BED; REF_GENOME=/path/to/REF_GENOME;
/path/to/sampleA.bam /path/to/sampleB.bam /path/to/sampleC.bam
Default execution:
scripts/create_baseline.sh PATH/TO/BAM_LIST PATH/TO/BEDFILE PATH/TO/REF_GENOME
If you already edited the create_baseline.sh script to point to your reference files, you can instead just run:
scripts/create_baseline.sh PATH/TO/BAM_LIST
RECOMMENDED: Now that the baseline file has been created, edit the baseline file to exclude loci which have standard deviations of zero. NOTE: Loci are excluded from the baseline file if the number of samples are insufficient to calculate statistics.
The baseline contstruction process only need to be done once per assay/target data set. Files may be saved and re-used for subsequent analyses.
BEDFILE=/path/to/CUSTOM_BEDFILE; MSI_BASELINE=/path/to/CUSTOM_MSI_BASELINE; REF_GENOME=/path/to/REF_GENOME;
Default execution:
scripts/run_msings.sh PATH/TO/BAM_LIST PATH/TO/BEDFILE PATH/TO/REF_GENOME PATH/TO/MSI_BASELINE
If you already edited the create_baseline.sh script to point to your reference files:
scripts/run_msings.sh PATH/TO/BAM_LIST
Test to insure proper installation of scripts: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cd msings source msings-env/bin/active ./testall Ran 9 tests in 0.068s OK