mskilab-org / Ppurple

Probabilistic purity ploidy estimation
4 stars 1 forks source link

Build Status codecov.io

Ppurple

Probabilistic purity ploidy estimation

Installation

  1. Install dependent packages and latest Bioconductor (if you haven't already)
source("https://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
  1. Install devtools from CRAN (if you don't have it already)
install.packages('devtools')
  1. Install mskilab R dependencies (gUtils and bamUtils)
devtools::install_github('mskilab/gUtils)
devtools::install_github('mskilab/bamUtils)
  1. Install Ppurple
devtools::install_github('mskilab/Ppurple')

Tutorial

Ppurple uses EM to infer purity and ploidy from total coverage and heterozygote counts, provided as data.frames, data.tables, or GRanges. It returns a data.table of ranked solutions, associated with a posterior probability.

Load Ppurple

library(Ppurple)

Load coverage, hets, and segs

> cov = fread(system.file("extdata", "coverage.csv", package = "Ppurple"))
> head(cov)
seqnamesstartendwidthstrandy
1 79401 79600 200 * 1.5395624
1 531201 531400 200 * 1.9518525
1 555401 555600 200 * 0.8810031
1 571601 571800 200 * 0.8270474
1 573601 573800 200 * 1.0702993
1 617801 618000 200 * 1.3891481
> hets = fread(system.file("extdata", "hets.csv", package = "Ppurple"))
> head(hets)
seqnamesstartendwidthstrandALTREFaltref
1 779322 7793221 * G A 36 37
1 998395 9983951 * G A 27 33
1 998501 9985011 * C G 21 29
1 115827711582771 * A G 29 23
1 116066511606651 * A G 27 24
1 120661912066191 * A C 1 52

Run Ppurple without precomputed segs

> pp = ppurple(cov = cov, hets = hets, verbose = TRUE)
Segments not provided so doing internal segmentation via DNAcopy
sending  92654  segments to DNAcopy
... ...
Ppurple EM iteration 3 :
        LL diff:58.2742695834022 tol: 1
Ppurple EM iteration 4 :
        LL diff:0.912960716173984 tol: 1

output is a data.table of solutions and their probabilities, showing the most likely solution in the first row

> pp[1,]
purityploidyprob
0.493.861

Run Ppurple with pre-computed segs

> segs = fread(system.file("extdata", "segs.csv", package = "Ppurple"))
> head(segs)
seqnamesstartendwidthstrandIDnum.markseg.mean
1 79401 6376801 6297401* Sample.1183 0.0683
1 6437801 8702401 2264601* Sample.1 76 -0.0967
1 8758201 9043001 284801* Sample.1 9 0.1808
1 90802012004680110966601* Sample.1347 -0.1132
1 2011320123387801 3274601* Sample.1102 0.0931
1 234212014815440124733201* Sample.1795 -0.1152
> pp = ppurple(cov = cov, hets = hets, segs = segs, verbose = TRUE)
Fitting initial grid of 11 purity and 21 ploidy combinations.
Hapseg iteration 1:
        LL diff: 1e+100 tol:1
Hapseg iteration 2:
        LL diff: 5.17692387802526e-06 tol:1
Running ppemgrid with 11 purities ranging from 0 to 1 and 21 ploidies ranging from 1 to 5 with rho of 1.00978885796089 het rho of 22.283605.
... ...
Ppurple EM iteration 3 :
        LL diff:58.6840663431212 tol: 1
Ppurple EM iteration 4 :
        LL diff:0.864216329413466 tol: 1
>  pp[1,]
purityploidyprob
0.493.861

Attributions

Marcin Imielinski - Assistant Professor, Weill Cornell Medicine; Core Member, New York Genome Center

Aditya Desphande - Tri-I CBM PhD candidate, Weill Cornell Medicine