skandlab / SMuRF

MIT License
21 stars 7 forks source link

SMuRF v3.0

By Skandlab

Genome Institute of Singapore, A*STAR

Check out the latest SMuRF version here


Introduction

SMuRF R package predicts a consensus set of somatic mutation calls using RandomForest machine learning. SMuRF generates a set of point mutations and insertions/deletions (indels) trained on the latest community-curated tumor whole genome sequencing data (Alioto et. al., 2015, Nat. Comms.). Our method is fast and accurate and analyses both whole-genome and whole-exome sequencing data from different cancer types.

For more information see our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btz018

Citation
Huang W, Guo YA, Chang MM and Skanderup AJ. Ensemble-Based Somatic Mutation Calling in Cancer Genomes. In: Boegel S, editor. Bioinformatics for Cancer Immunotherapy: Methods and Protocols. New York, NY: Springer US; 2020. p. 37-46.

Huang W, Guo YA, Muthukumar K, Baruah P, Chang MM and Skanderup AJ. SMuRF: Portable and accurate ensemble prediction of somatic mutations. Bioinformatics (Oxford, England). 2019:btz018-btz. doi:10.1093/bioinformatics/btz018.


Table of contents

Input from bcbio-nextgen pipeline
Input directly from VCF Callers (optional)
Test Dataset
Requirements
Installation
Parameters
Running SMuRF: Selecting the correct input vcfs
Running SMuRF: Detecting and changing genome build
Running SMuRF: Tweaking SMuRF score cut-off
Output format
Running on multiple samples



Input from bcbio-nextgen pipeline

Before running SMuRF, you require output data from the bcbio-nextgen pipeline that generates the VCF output for the variant callers: MuTect2, FreeBayes, VarDict, VarScan and the latest Strelka2. An additional caller Strelka2, has been added since SMuRF 2.0 and the information is documented on our wiki page.

SMuRF v1.6.4 is still available here: SMuRFv1.6.4
SMuRF v1.6.4 wiki page: readme file

Note that your vcf.gz files need to be tab-indexed (.tbi files required) for retrieving gene annotations in SMuRF. We would recommend the bcbio-nextgen pipeline for a better user experience. See Running SMuRF: Selecting the correct input vcfs for more information.

SMuRF requires the VCF output from each caller (.vcf.gz) to be placed in the same directory and files tagged with the caller (eg. sample1-mutect.vcf.gz, sample1-freebayes.vcf.gz, sample1-vardict.vcf.gz, sample1-varscan.vcf.gz)


Input directly from VCF Callers (optional)

For Users not running bcbio-nextgen pipeline: Alternatively, install and execute the individual callers.

Refer to the installation and instructions for each caller:
- VarDict
- VarScan
- MuTect2
- FreeBayes
- Strelka2


Test Dataset

In this vignette, we utilise a partial output dataset derived from the chronic lymphocytic leukemia (CLL) data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539. The dataset for testing the package is provided in the SMuRF package.


Requirements

R 3.3 & 3.4 : bioconductor::VariantAnnotation

R >=3.5 : BiocManager::VariantAnnotation

h2o package : If h2o package takes some time to download/install (~350MB), try manually installing from their AWS page.


Installation


1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF

  1. You can install the current SMuRF directly from Github via the following R commands:
    #devtools is required
    install.packages("devtools")
    library(devtools)
    install_github("skandlab/SMuRF", subdir="smurf")


(Alternative option) SMuRF installation via downloading of the package from Github:

#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf and save to your local directory
install.packages("my/current/directory/smurf", repos = NULL, type = "source")


SMuRF concurrently predicts single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels) and saves time by parsing the VCF files once.

_Missing packages will be installed the first time you run SMuRF._

library("smurf") #load SMuRF package

smurf() #check version and parameters

# "SMuRFv3.0 (16th Jan 2024)"
smurf(directory=NULL, mode=NULL, nthreads = -1,
                 annotation=F, output.dir=NULL,  parse.dir=NULL,
                 snv.cutoff = 'default', indel.cutoff = 'default',
                 build=NULL, change.build=F, find.build=F,
                 t.label=NULL, re.tabIndex=F,
                 check.packages=T, file.exclude=NULL)

myresults = smurf(mydir, 'combined', build='hg19') #save output into 'myresults' variable

back to top


Parameters

Arguments Description
directory Choose directory where the Variant Caller Format(VCF) files are located
output.dir Path to output directory (if saving files as .txt)
parse.dir Specify if changing SMuRF default cutoffs. Path to the location of existing snv-parse.txt and indel-parse.txt files generated by SMuRF
mode Choose "snv", "indel" or "combined" (snv+indel). "combined" provides a separate list of SNVs and indels.
annotation TRUE or FALSE (default). Provide gene annotations for each variant call.
nthreads Number of cores used for RandomForest prediction. Default (-1) for maximum number of cores. For 32-bit Windows, only 1 core is allowed (nthreads=1).
t.label (Optional) Provide the sample name for your tumour sample to ease the identification of the normal and tumour sample names in your vcf
file.exclude (Optional) Additional keywords in file directory names to be filtered.
build Specify your human genome build: build="hg19" or build="hg38"
change.build TRUE or FALSE (default). For conversion of your genomic coordinates
find.build TRUE or FALSE (default). Additional genome build check for the annotation step.
snv.cutoff Default SMuRF_score cutoff for the SNV model unless a number between 0 to 1 is stated
indel.cutoff Default SMuRF_score cutoff for the INDEL model unless a number between 0 to 1 is stated
re.tabIndex TRUE or FALSE (default). Set to TRUE to create tab-indexed (.tbi) files for each vcf
check.packages=T Developer mode

For more information on the parameters see R documentation:

help(smurf)


Examples:

library("smurf") #load SMuRF package

 myresults = smurf(directory="/path/to/directory..",
                   mode="snv", #snv only
                   output.dir="/path/to/output", #saving your output
                   build='hg19')

 #Include gene annotations for coding regions in output
 myresults = smurf(directory="/path/to/directory..",
                   mode="combined", #snv and indel predictions
                   annotation=T, #generate gene annotations
                   build='hg19')

back to top


Running SMuRF: Selecting the correct input vcfs


SMuRF requires 5 caller VCF (vcf.gz) files as input stated under the "directory" parameter. Provide a path to a directory containing all 5 caller VCF files. caller.vcf.gz (compressed) and caller.vcf are accepted formats.

The tab-indexed (.tbi) files for each caller are required for the parsing step. If the .tbi files are missing, specify using re.tabIndex=T on SMuRF to generate these files.

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode ="snv", nthreads = 1, annotation = T, build = 'hg19',
                  re.tabIndex = T)  #generate .tbi files in directory
#"Generating .tbi files in directory..."
# Connection successful!

#If the vcf files are in different directories:

#Specify directories manually
 dir.list = list(mutect='/path1/to/mutect.vcf.gz',
                 freebayes='/path2/to/freebayes.vcf.gz',
                 vardict='/path3/to/vardict.vcf.gz',
                 varscan='/path4/to/varscan.vcf.gz',
                 strelka='/path5/to/strelka.vcf.gz')
 myresults = smurf(directory=dir.list, 
                   mode="combined", build='hg19')


In some cases, your input directory may contain other VCF files generated by bcbio. For example, germline VCF files, copy-number related files, older version VCFs. An exclusion file.exclude can be added to make sure that SMuRF selects the correct VCF files.

list.files(directory)
# sample1.mutect.vcf.gz
# sample1.mutect-germline.vcf.gz #to be excluded
# sample1.freebayes.vcf.gz
# sample1.vardict.vcf.gz
# sample1.varscan.vcf.gz
# sample1.varscan-version1.vcf.gz #to be excluded
# sample1.strelka.vcf.gz
# sample1.strelka-archive.vcf.gz #to be excluded

myresults = smurf(directory="/path/to/directory..", 
                  file.exclude = c("germline","version1","archive") #keywords in file name to be excluded
                  mode="snv",
                  output.dir="/path/to/output", build='hg19')


It is optional to indicate your normal and tumour sample labels. SMuRF detects your normal and tumour sample names in order to generate variant allele frequency information. If this information is missing in your VCF headers, SMuRF will terminate with an error. State your unique tumour sample label using t.label.

Possible normal/tumour sample labels:

sample1-N, sample1-T
sample1_normal, sample1_tumour
sample1.healthy, sample1.cancer

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode ="combined", nthreads = 1, build = 'hg19',
                  t.label = 'tumour' #optional if labels were detected from vcf headers correctly
                  )

back to top


Running SMuRF: Detecting and changing genome build


The genome build for your sample must be specified ( build='hg19' or build='hg38' ).

hg19 also refers to the Genome Reference Consortium Human Build 37 (GRCh37)
hg38 also refers to the Genome Reference Consortium Human Build 38 (GRCh38)

The genome build stated in SMuRF will be cross-checked with the build used in your VCF files.

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode ="combined", nthreads = 1, annotation = T, 
                  build = 'hg38' #wrong build stated
                  )
# "Genome build stated in SMuRF:"
# "hg38"
# "Ref genome used in vcf:"
# "file:///home/projects/13001264/softwares/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
# "Warning: build provided does not match ref genome used in vcf. SMuRF CDS annotation may not run properly if genome build is incorrect."
# "Final genome build used for analysis: hg38"
# 
# Warning message


If you are unsure of the genome build used in your analysis, specify find.build=T.

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode ="combined", nthreads = 1, annotation = T, 
                  build = 'hg38', #wrong build stated
                  find.build = T, #if unsure of genome build
                  )
# "Genome build stated in SMuRF:"
# "hg38"
# "Ref genome used in vcf:"
# "file:///home/projects/13001264/softwares/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
# "Warning: build provided does not match ref genome used in vcf. SMuRF CDS annotation may not run properly if genome build is incorrect."
# "Changing build variable provided"
# "hg38 -> hg19"
# "Final genome build used for analysis: hg19"

# No errors


Samples from different batches may be aligned to a different genome reference build. In order to standardize your gene annotations and output, specify change.build for genome build conversion.

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode ="combined", nthreads = 1, annotation = T, 
                  build = 'hg19',
                  change.build = T, #genome build conversion
                  )
# "Genome build stated in SMuRF:"
# "hg19"
# "Ref genome used in vcf:"
# "file:///home/projects/13001264/softwares/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
# "Final genome build used for analysis: hg19"

# "Compiling annotations"
# "Changing annotations from hg19 to hg38"

back to top


Running SMuRF: Tweaking SMuRF score cut-off


SMuRF v3.0 is fine-tuned to achieve the max f1 score in our test set.

Re-adjust the stringency of the prediction with a specific cut-off value. Use parameters snv.cutoff or indel.cutoff to adjust the thresholds (higher cut-off provide a smaller set of calls with better confidence).

To re-adjust the cut-off value of an existing SMuRF output, simply provide the parse.dir to the snv-parse and indel-parse files for re-processing.

#start with default cutoffs
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode="combined", 
                  snv.cutoff='default', indel.cutoff='default',
                  output.dir = 'C:/Users/admin/myresults') 

#modify cutoff from existing SMuRF parse files
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode="combined",
                  snv.cutoff=0.2, indel.cutoff=0.1, #specify new cutoffs
                  parse.dir = 'C:/Users/admin/myresults', #SMuRF path existing parse.txt files
                  output.dir = 'C:/Users/admin/myresults2' #new output) 

#Plot histogram
hist(as.numeric(myresults$smurf_indel$predicted_indel[,'SMuRF_score']), main = 'Re-adjusted predicted indels', xlab = 'SMuRF_score', col = 'grey50')

back to top


Output format

Output files available include:

  1. Parsed-raw file (parse)

  2. Predicted positive mutations (predicted)

  3. Predicted positive mutations with annotations (annotated)* #for smurf's "cdsannotation" function only

  4. Variant statistics (stats)

  5. Time taken (time)

#Viewing predicted output in R

myresults$smurf_snv$predicted_snv

myresults$smurf_indel$predicted_indel

#see column description below
Column Description
Chr Chromosome number
START_POS_REF/END_POS_REF Start and End nucleotide position of the somatic mutation
REF/ALT Consensus Ref and Alt nucleotide changes of the highest likelihood
REF_MFVdVs/ALT_MFVdVs Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs) and Strelka2 (not abbreviated to preserve column name)
FILTER Pass (TRUE) or Reject (FALSE) [boolean] mutation calls from the individual callers
Sample_Name Sample name is extracted based on your labeled samples in the vcf files
Alt_Allele_Freq Mean Variant allele frequency calculated from the tumor reads of the callers
Depth ref/alt N/T Mean read depth from the N/T sample for ref/alt alleles
SMuRF_score SMuRF confidence score of the predicted mutation


myresults$smurf_indel$stats_indel

#             Passed_Calls
# Strelka2             466
# Mutect2              232
# FreeBayes            306
# VarDict              483
# VarScan             1273
# Atleast1            2431
# Atleast2             278
# Atleast3              43
# Atleast4               7
# All5                   1
# SMuRF_INDEL           88

myresults$smurf_snv$stats_snv

#           Passed_Calls
# Strelka2          1362
# Mutect2           1539
# FreeBayes          216
# VarDict            239
# VarScan           1734
# Atleast1          4017
# Atleast2           928
# Atleast3            60
# Atleast4            48
# All5                37
# SMuRF_SNV         1043


We added gene annotations using SnpEff (from bcbio) and SMuRF extracts the coding annotations from the canonical transcripts with the highest fucntional impact. Take note that your vcf.gz files should be tab-indexed (see Running SMuRF: re.tabIndex).

myresults = smurf(mydir, "cdsannotation") #runs SMuRF for SNV and indels + generate annotations

myresults$smurf_snv_annotation$annotated[order(myresults$smurf_snv_annotation$annotated$REGION)[1:2],]
#    Chr START_POS_REF END_POS_REF REF ALT   REF_MFVdVs   ALT_MFVdVs FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict
# 52   1      77806132    77806132   G   A    G/G/G/G/G    A/A/A/A/A           TRUE             TRUE           TRUE
# 81   1     170961432   170961432   C   T C/NA/NA/NA/C T/NA/NA/NA/T           TRUE            FALSE          FALSE
#    FILTER_Varscan FILTER_Strelka2     Sample_Name Alt_Allele_Freq N_refDepth N_altDepth T_refDepth T_altDepth Allele
# 52           TRUE            TRUE icgc_cll_tumour           0.500         14          0         15         15      A
# 81          FALSE            TRUE icgc_cll_tumour           0.467         33          0         16         14      T
#          Annotation   Impact Gene_name         Gene_ID Feature_Type      Feature_ID Transcript_BioType  Rank    HGVS.c
# 52 missense_variant MODERATE       AK5 ENSG00000154027   transcript ENST00000354567     protein_coding  6/14  c.770G>A
# 81 missense_variant MODERATE     MROH9 ENSG00000117501   transcript ENST00000367759     protein_coding 12/22 c.1156C>T
#         HGVS.p  cDNA.pos   CDS.pos  AA.pos Distance REGION SMuRF_score
# 52 p.Arg257His 1033/3248  770/1689 257/562        .    CDS   0.9083840
# 81 p.Arg386Cys 1310/3165 1156/2586 386/861        .    CDS   0.8107475


Time taken for your run:

myresults$time.taken

<!-- Time difference of 20.52405 secs -->


The raw parsed output:

myresults$smurf_indel$parse_indel

myresults$smurf_snv$parse_snv


Indicate the output.dir to save the SMuRF output as tab-delimited .txt files in your targeted directory.

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode="combined", 
                  output.dir = 'C:/Users/admin/myresults' #path to output directory
                  ) 

back to top


Running on multiple samples

Iterate over multiple samples by providing the list of directories of where your sample files are located.

project.dir = 'path/to/my/dir'
list.files(project.dir)
# sample_A
# sample_B
# sample_C
samples = c('sample_A', 'sample_B', 'sample_C') #sample dir where vcf files are located

for(i in 1:length(samples)) {
 smurf(directory=paste0(project.dir, '/', samples[i]),
        mode="combined", build='hg19', annotation = T,
        output.dir = paste0('C:/Users/admin/myresults/',samples[i]))
 } 

Running SMuRF on multiple samples on a cluster (parallel multi-core instance)

install.packages("foreach")
install.packages("doParallel")
install.packages("doSNOW")

library(foreach)
library(doParallel)
library(doSNOW)
library(smurf)

project.dir = 'path/to/my/dir'
samples = Sys.glob(paste0(project.dir,'/*'))

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)

foreach(i=1:length(samples), .packages = 'smurf', .verbose = F) %dopar% {
print(i)
  smurf(directory = paste0(project.dir, '/', samples[i]),
      mode ="combined", nthreads = 1, build = 'hg19',
      output.dir = paste0('C:/Users/admin/myresults/',samples[i]))
)
}
stopCluster(cl)
h2o.shutdown()


For errors and bugs, please report on our Github page.

back to top