tdhock / PeakSegFPOP-paper

Functional pruning optimal partioning algorithm for PeakSeg constrained segmentation model
1 stars 0 forks source link

This repository contains the source code for our papers describing log-linear time algorithms for solving the PeakSeg constrained changepoint detection problem.

** [[https://arxiv.org/abs/1703.03352][arXiv:1703.03352]] A log-linear segmentation algorithm for peak detection in genomic data

Updated pre-print: [[file:jmlr-paper.pdf]], [[file:jmlr-supplementary.pdf]]

The labeled data we used to compute peak detection accuracy are http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/

The fold IDs that we used in our four-fold cross-validation experiment to compute test AUC/accuracy are listed in http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/4foldcv-test-folds.csv

To recompute the figures from the arXiv paper, execute the code in the following R scripts.

To recompute other figures:

Note that the [[file:Makefile]] shows how to re-make some intermediate RData files based on the [[http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/][ChIP-seq benchmark data]].

** [[https://arxiv.org/abs/1810.00117][arXiv:1810.00117]] Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

The [[https://github.com/tdhock/feature-learning-benchmark][benchmark data we used in this paper]] are much larger than the previous paper. To re-compile the figures and paper you can type "make jss-paper.pdf"

** TODOs

** 12 Aug 2022

[[file:2022-06-20-paris-time-complexity.tex]] makes [[file:2022-06-20-paris-time-complexity.pdf]] slides with summary of results from three recent papers for talk in Paris.

Updated: [[file:2022-09-28-tucson-slides.tex]] makes [[file:2022-09-28-tucson-slides.pdf]]

Title: Time complexity analysis of recently proposed algorithms for optimal changepoint detection

Slides: https://github.com/tdhock/PeakSegFPOP-paper/blob/master/2022-09-28-tucson-slides.pdf

Abstract: Several dynamic programming algorithms have been recently proposed for efficiently solving various problems related to optimal changepoint detection in large data sequences measured over space or time. In this talk we will discuss three recent papers which each provide a theoretical or empirical time complexity analysis. Our time complexity analysis shows that these linear and log-linear time algorithms can be used for analysis of huge data sequences, with ten million observations or more, which are now common in areas such as genomics.

** 24 July 2019 [[file:jss-figure-more-evals.R]] creates

[[file:jss-figure-more-evals.png]]

** 2 July 2019

[[file:2019-useR-slides.Rnw]] for useR conf.

** 22 July 2019 [[file:jss.more.evals.R]] attempts to answer the question from a reviewer: does the number of GFPOP evals depend on the max number of peaks?

** 18 Oct 2018

** 28 Mar 2017

Ideas about how to define a state graph for a solver of the Optimal Partitioning problem with affine constraints between adjacent segment means.

+BEGIN_SRC R

library(data.table)

Define an affine constraint function

g(from, to) = from.coeffrom + to.coefto + constant

to be used as g(from, to) <= 0,

e.g. from <= to is coded as 1from -1to + 0 <= 0.

affine.constraint <- function(from.coef, to.coef, constant){ data.table(from.coef, to.coef, constant) }

no.constraint <- affine.constraint(0, 0, 0) non.increasing <- affine.constraint(-1, 1, 0) non.decreasing <- affine.constraint(1, -1, 0)

state <- function(state.name){ structure(data.table(state.name), type="state") }

change <- function(from, to, constraint, penalty=NA){ structure(data.table(from, to, penalty, constraint), type="change") }

loss <- function(loss.name){ structure(data.table(loss.name), type="loss") }

start <- function(...){ structure(data.table(state.name=c(...)), type="start") }

end <- function(...){ structure(data.table(state.name=c(...)), type="end") }

unconstrained <- list( state("anything"), change("anything", "anything", no.constraint))

unconstrained.Gaussian <- c(unconstrained, list( loss("Gaussian")))

unconstrained.Poisson <- c(unconstrained, list( loss("Poisson")))

PeakSegFPOP <- list( loss("Poisson"), state("peak"), state("background"), start("background"), change("background", "peak", non.decreasing, penalty=0), change("peak", "background", non.increasing), end("background"))

PeakSegFPOP.start.or.end.up <- list( loss("Poisson"), state("peak"), state("background"), change("background", "peak", non.decreasing), change("peak", "background", penalty, non.increasing))

reduced.isotonic.regression <- list( state("anything"), change("anything", "anything", non.decreasing))

unimodal.regression <- list( state("can.change.up.or.down"), state("can.change.down"), change("can.change.up.or.down", "can.change.up.or.down", non.decreasing), change("can.change.up.or.down", "can.change.down", non.increasing), change("can.change.down", "can.change.down", non.increasing))

unimodal.at.least.one.up <- c(unimodal.regression, list( state("start"), start("start"), change("start", "can.change.up.or.down", non.decreasing)))

unimodal.at.least.one.up.and.down <- c(unimodal.at.least.one.up, list( end("can.change.down")))

checkModel <- function(model.list){ type.vec <- sapply(model.list, attr, "type") model.info <- sapply(unique(type.vec), function(type){ do.call(rbind, model.list[type.vec==type]) })

TODO error checking.

model.info

} checkModel(unimodal.at.least.one.up.and.down) checkModel(PeakSegFPOP)

TODO functions for plotting, solving.

GFPOP(model, data.vec, weight.vec, penalty)

+END_SRC

26 Jan 2017 Guillaume's group meeting presentation slides http://members.cbio.mines-paristech.fr/~thocking/HOCKING-PeakSegFPOP-pipeline-slides.pdf 10 Nov 2016

[[http://bl.ocks.org/tdhock/raw/9311ca39d643d127e04a088814c81ee1/][Data viz with smooth transitions, clarified titles]].

[[http://bl.ocks.org/tdhock/raw/7b595e74d059eb2e066d46a90c5b7724/][Revised interactive data viz]].

** 9 Nov 2016

[[http://bl.ocks.org/tdhock/raw/9a6ac163b8610314ed8e9751937ecea9/][Interactive data viz to explain supervised penalty learning for peaks]].

** 15 Aug 2016

Test accuracy and AUC data viz, explains why Segmentor gets such a high test accuracy (it has a low true positive and false positive rate) http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/

** 10 Aug 2016

** 3 Aug 2016

[[file:figure-cDPA-PDPA-all.R]] visualizes the optimality and feasibility of the PDPA and cDPA models, and shows the interval counts in the PDPA [[http://bl.ocks.org/tdhock/raw/4582904f843cc60639fdfeb9651cac73/]]

** 12 May 2016

[[file:figure-cDPA-PDPA.R]] shows the difference between the cDPA and PDPA on real data: the cDPA recovers a sub-optimal solution that obeys the strict inequality peak constraint, and the PDPA recovers the optimal solution for the non-strict inequality peak constraint. http://bl.ocks.org/tdhock/raw/24aa6387901edab1577ce24f1e736ff3/

** 10 May 2016

** 4 May 2016