gymreklab / STRDenovoTools

Toolkit for calling and analyzing de novo STR mutations
GNU General Public License v3.0
13 stars 4 forks source link

Build Status DOI

MonSTR

MonSTR is a toolkit for identifying and analyzing de novo tandem repeat mutations. The core code for calling de novos and parsing VCF files is modified from that originally written by Thomas Willems in the HipSTR repository.

Authors:

Table of contents:

  1. Installation instructions
  2. Basic usage
  3. Preparing your input VCF file
  4. Detailed filtering options
  5. Specifying mutation models
  6. Output file formats
  7. Additional scripts
  8. Citing MonSTR

Install

You can install MonSTR using the following commands:

git clone https://github.com/gymreklab/STRDenovoTools  
cd STRDenovoTools
mkdir build
cd build
cmake ..
make
sudo make install

You will need to have a C-compiler and cmake installed. Type MonSTR --help to check that MonSTR installed successfully.

Basic usage

MonSTR takes as input a VCF file generated by HipSTR or GangSTR and a pedigree file in .fam format describing family relationships. It outputs text files describing mutation probabilities at each locus in each trio.

Important The VCF files must contain genotype likelihoods unless running in naive mode. The VCFs must have been run on all samples in a family jointly so that their genotype likelihood fields contain info for the same set of alleles. See "Preparing your VCF file" below. Sample IDs in the pedigree file must match exactly to those in the VCF header.

Below shows a basic MonSTR command:

MonSTR \
    --strvcf test.vcf.gz \
    --fam family.fam \
    [--gangstr] \
    --out test

By default, a HipSTR VCF is expected. If using a GangSTR VCF, use the option --gangstr. VCFs must be sorted and indexed using tabix. This can be accomplished by:

bgzip file.vcf
tabix -p vcf file.vcf.gz

The following example command can be run using test data available in the TestData folder of this repository:

MonSTR \
  --strvcf TestData/Q3.filtered.vcf.gz \
  --fam TestData/TestQuadFam.ped \
  --out test \
  --gangstr

Type MonSTR --help, or the documentation below, to see more detailed information about which options are available.

Preparing your VCF file

By default, MonSTR computes a joint likelihood of all possible genotype configurations at each locus in trio. To do so, it must have detailed genotype likelihood information specifying the likelihood of each possible diploid genotype, and the set of possible genotypes considered must be identical in each family member. Therefore, when generating your input VCF file with HipSTR or GangSTR, you must:

The sample IDs in the VCF output file must match those in the .fam file and should not be repeated across families.

Note, GangSTR only estimates the number of repeat units. It will not output sequence differences, or differences from the reference allele consisting of a non-integer number of repeats. HipSTR can detect sequence differences, or non-integer changes in repeat copy number. To collapse alleles or the same length, or round alleles to the nearest repeat copy number, use the following options:

Filtering options

MonSTR provides the following filtering options:

General filters:

HipSTR-specific filters:

GangSTR-specific filters:

MonSTR will only process a trio of all samples (mother, family, and child) pass all specified filters.

Specifying custom mutation models

By default, MonSTR uses a statistical model to compute a posterior probabilty of mutation at each TR in each child ("Model-based calling" below). You may also use a naive method ("Naive calling") to identify mutations.

Model-based calling

MonSTR computes a a joint likelihood across parent and child genotypes for all possible genotype configurations compatible with either Mendelian inheritance or a mutation occurring. It then uses these likelihoods to compute the posterior probability of a mutation. Besides genotype likelihoods, which are computed by GangSTR or HipSTR, there are three additional components to the mutation model used:

Alternatively you can set per-locus parameters as described below.

In addition to providing global mutation parameters, you may also specify a file with per-locus parameters using the option --mutation-models <FILE>. This file should be whitespace delimited with a single line per locus and no header line. The file should have the following columns present in each line (in order): chromosome, TR start position, TR end position, repeat unit length (bp), log10 mutation rate, beta, geomp, central_allele. The start coordinate must match exactly to the VCF INFO:START field in the HipSTR VCF for each locus.

Naive calling

In addition to model-based mutation calling, MonSTR implements a naive mutation detection algorithm.

Chromosome X

If analyzing mutations on the X chromosome, specify the --chrX option. This assumes the entire VCF is for chromosome X. Sex information must be available for children. You can use either model-based or naive mutation calling on chromosome X.

If using --chrX, it is recommended that males have haploid calls. If you are using GangSTR, you can use --bam-samps and --samp-sex to perform sex-aware calling of sex chromosomes.

Output formats

MonSTR outputs files <OUTPREFIX>.all_mutations.tab and <OUTPREFIX>.locus_summary.tab (<OUTPREFIX> is specified by the --out option). The columns in these files are described below.

The following options also control which loci/mutations are output:

<OUTPREFIX>.all_mutations.tab

This file contains one line per TR per trio. It contains the following columns:

Column header Description
chrom The chromosome of the TR
pos The postion of the TR
period The length of the repeat unit (in bp)
prior The prior probability of mutation
family The family ID of the trio (taken from the .fam file)
child The sample ID of the child in the trio
phenotype The phenotype of the child, taken from the .fam file
posterior The posterior probability of mutation in this child at this TR locus. If using naive calling, this is always set to 1 for mutations, else 0. If using naive expansion detection, this is set to -1 for candidate expansions.
newallele The allele arising from a de novo mutation (number of repeat units). This field is only meaningful if there was a mutation inferred.
mutsize The size of the mutation (number of repeat units). This field is only meaningful if there was a mutation inferred.
inparents Set to 1 if the new allele is found in the parents, else 0. This field is only meaningful if there was a mutation inferred.
poocase Gives the inferred parent of origin of the mutation, if a mutation was inferred. 0: unknown; 1: Mendelian (no denovo); 2: New allele from father; 3: New allele from mother; 4: Unclear
isnew Set to 1 if the new alleles is not found in any control samples in the input VCF. Otherwise set to 0. This field is only meaningful if there was a mutation inferred.
case_count Count of the new allele in cases (phenotype=2). This field is only meaningful if there was a mutation inferred.
ctrl_count Count of the new allele in controls (phenotype=1). This field is only meaningful if there was a mutation inferred.
unk_count Count of the new allele in samples with unknown phenotype (phenotype=0). This field is only meaningful if there was a mutation inferred.
child_gt Genotype of the child sample.
mat_gt Genotype of the mother.
pat_gt Genotype of the father.
encl_child Number of enclosing reads of the new allele in the child. This field is only meaningful if there was a mutation inferred.
encl_mother Number of enclosing reads of the new allele in the mother. This field is only meaningful if there was a mutation inferred.
encl_father Number of enclosing reads of the new allele in the father. This field is only meaningful if there was a mutation inferred.
encl_parent Number of encloseing reads of the new allele in the parent the mutation came from (see poocase).
long_mother Indicates whether the longer vs. shorter allele was transmitted from the mother. This is only relevant if the mother is heterozygous. 0=unknown; 1=shorter; 2=longer.
long_father Indicates whether the longer vs. shorter allele was transmitted from the father. This is only relevant if the father is heterozygous. 0=unknown; 1=shorter; 2=longer.

<OUTPREFIX>.locus_summary.tab

This file contains a single line per TR. It contains the following columns:

Column header Description
chrom The chromosome of the TR
pos The postion of the TR
period The length of the repeat unit (in bp)
ref_allele_size The number of repeats in the reference genome
num_alleles_bylength The number of possible alleles at the locus, collapsing all alleles with the same length.
num_alleles_byseq The number of possible alleles at the locus. For GangSTR, this will always be the same as num_alleles_by_length since it only considers length differences.
het_by_seq Heterozygosity of the locus
total_children The total number of children (trios) analyzed at the locus.
total_mutations The total number of mutations called at the locus.
total_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions.
affected_children The total number of affected children (trios) analyzed at the locus.
affected_mutations The total number of mutations called at the locus in affected samples.
affected_new_mutations The total number of mutations called at the locus in affected samples resulting in alleles unobserved in the rest of the cohort.
affected_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only affected samples.
unaffected_children The total number of unaffected children (trios) analyzed at the locus.
unaffected_mutations The total number of mutations called at the locus in unaffected samples.
unaffected_new_mutations The total number of mutations called at the locus in unaffected samples resulting in alleles unobserved in the rest of the cohort.
unaffected_mutation_rate The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only unaffected samples.
p-value P-value based on a Fisher's exact test testing if mutations are enriched in affecteds vs. unaffecteds.
children_with_mutations Comma-separated list of children with mutations. Format of each child is famid:phenotype:newallele:controlcount,casecount,unknowncount.
aff_tdt Keep track of how often the long vs. short allele were transmitted from mother and father in affected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file.
unaff_tdt Keep track of how often the long vs. short allele were transmitted from mother and father in unaffected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file.

Additional scripts

See scripts/ for additional scripts:

Cite

If you use this method for your research, please cite: Mitra, et al. 2020 https://www.biorxiv.org/content/10.1101/2020.03.04.974170v1