This repository contains the source code for our papers describing log-linear time algorithms for solving the PeakSeg constrained changepoint detection problem.

** [[https://arxiv.org/abs/1703.03352][arXiv:1703.03352]] A log-linear segmentation algorithm for peak detection in genomic data

Updated pre-print: [[file:jmlr-paper.pdf]], [[file:jmlr-supplementary.pdf]]

The labeled data we used to compute peak detection accuracy are http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/

The fold IDs that we used in our four-fold cross-validation experiment to compute test AUC/accuracy are listed in http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/4foldcv-test-folds.csv

To recompute the figures from the arXiv paper, execute the code in the following R scripts.

LaTeX source: [[file:arxiv-paper.tex]]
Figure 1: [[file:figure-compare-unconstrained.R]]
Figure 2: [[file:figure-2-min-envelope.R]]
Figure 3: [[file:figure-PDPA-intervals.R]] and [[file:figure-PDPA-timings.R]], =make figure-PDPA-interals.png= and =make figure-PDPA-timings.pdf=
Figure 4: [[file:figure-test-error-dots.R]], =make figure-test-error-dots.pdf=

To recompute other figures:

[[file:figure-data-models.R]]
[[file:figure-good-bad.R]]

Note that the [[file:Makefile]] shows how to re-make some intermediate RData files based on the [[http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/][ChIP-seq benchmark data]].

** [[https://arxiv.org/abs/1810.00117][arXiv:1810.00117]] Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

The [[https://github.com/tdhock/feature-learning-benchmark][benchmark data we used in this paper]] are much larger than the previous paper. To re-compile the figures and paper you can type "make jss-paper.pdf"

Sweave/LaTeX source: [[file:jss-paper.Rnw]]
Figures 3 (concave G function we maximize to find the model with P peaks) and 12 (time complexity of computing sqrt N peaks): [[file:jss-figure-evaluations.R]]
Figure 4 (PeakSegFPOP can be used to compute more likely models than the default MACS2 model): [[file:jss-figure-more-likely-models.R]]
Figure 5 (up-down constrained changepoint model is robust to spatial correlation): [[file:jss-figure-spatial-correlation.R]]
Figure 6 (number of intervals is log N and time/space complexity is N log N): [[file:jss-figure-target-intervals-models.R]]
Figure 7 (disk-based storage is a constant factor slower than memory-based storage, about 2x): [[file:jss-figure-disk-memory-compare-speed.R]]
Figure 8 (number of DP iterations using OP is much lower than SN to compute a large number of peaks): [[file:jss-figure-variable-peaks.R]]
Figure 9 (number of GFPOP calls depends on maximum number of peaks): [[file:jss-figure-more-evals.R]]
Figure 10 (label error): [[file:jss-figure-label-error.R]]
Figure 11 (N genomic data have sqrt N peaks): [[file:jss-figure-data-peaks.R]]

** TODOs

add figure files, [[file:figure-good-bad.R]]
new slides showing better peak detection specific examples [[file:figure-min-train-error.R]]
big model labels in [[file:figure-test-error-dots.R]]

** 12 Aug 2022

[[file:2022-06-20-paris-time-complexity.tex]] makes [[file:2022-06-20-paris-time-complexity.pdf]] slides with summary of results from three recent papers for talk in Paris.

Updated: [[file:2022-09-28-tucson-slides.tex]] makes [[file:2022-09-28-tucson-slides.pdf]]

Title: Time complexity analysis of recently proposed algorithms for optimal changepoint detection

Slides: https://github.com/tdhock/PeakSegFPOP-paper/blob/master/2022-09-28-tucson-slides.pdf

Abstract: Several dynamic programming algorithms have been recently proposed for efficiently solving various problems related to optimal changepoint detection in large data sequences measured over space or time. In this talk we will discuss three recent papers which each provide a theoretical or empirical time complexity analysis. Our time complexity analysis shows that these linear and log-linear time algorithms can be used for analysis of huge data sequences, with ten million observations or more, which are now common in areas such as genomics.

** 24 July 2019 [[file:jss-figure-more-evals.R]] creates

[[file:jss-figure-more-evals.png]]

** 2 July 2019

[[file:2019-useR-slides.Rnw]] for useR conf.

** 22 July 2019 [[file:jss.more.evals.R]] attempts to answer the question from a reviewer: does the number of GFPOP evals depend on the max number of peaks?

** 18 Oct 2018

new scheme for [[file:PDPA.infeasible.R]]: for background segments with the same mean either before or after, join peaks before and after. Before this file was using the small peaks in these infeasible models. Contrast with [[file:PDPA.peaks.R]] which simply ignores infeasible models.
new file [[file:PDPA.infeasible.error.compare.R]] which compares the label error of the two PDPA models with the CDPA model. Theoretically the PDPA.infeasible model with new join scheme should have more true positives and false positives than the PDPA.peaks model which simply ignores infeasible models. This file also shows that it reduces the min train error to be similar with CDPA.

** 28 Mar 2017

[[file:PDPA.timings.small.R]] runs the PeakSegPDPA algorithm for each data set in the [[http://members.cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/][McGill ChIP-seq benchmark]], and only saves info for the model up to the last data point. In constrast, the original [[file:PDPA.timings.R]] creates much larger PDPA.model.RData files, since it stores cost/data/intervals for all data points.
[[file:PDPA.targets.R]] computes target intervals for the PeakSegPDPA models, and [[file:problem.features.R]] computes features. ** 7 Feb 2017

Ideas about how to define a state graph for a solver of the Optimal Partitioning problem with affine constraints between adjacent segment means.

+BEGIN_SRC R

library(data.table)

Define an affine constraint function

g(from, to) = from.coeffrom + to.coefto + constant

to be used as g(from, to) <= 0,

e.g. from <= to is coded as 1from -1to + 0 <= 0.

affine.constraint <- function(from.coef, to.coef, constant){ data.table(from.coef, to.coef, constant) }

no.constraint <- affine.constraint(0, 0, 0) non.increasing <- affine.constraint(-1, 1, 0) non.decreasing <- affine.constraint(1, -1, 0)

state <- function(state.name){ structure(data.table(state.name), type="state") }

change <- function(from, to, constraint, penalty=NA){ structure(data.table(from, to, penalty, constraint), type="change") }

loss <- function(loss.name){ structure(data.table(loss.name), type="loss") }

start <- function(...){ structure(data.table(state.name=c(...)), type="start") }

end <- function(...){ structure(data.table(state.name=c(...)), type="end") }

unconstrained <- list( state("anything"), change("anything", "anything", no.constraint))

unconstrained.Gaussian <- c(unconstrained, list( loss("Gaussian")))

unconstrained.Poisson <- c(unconstrained, list( loss("Poisson")))

PeakSegFPOP <- list( loss("Poisson"), state("peak"), state("background"), start("background"), change("background", "peak", non.decreasing, penalty=0), change("peak", "background", non.increasing), end("background"))

PeakSegFPOP.start.or.end.up <- list( loss("Poisson"), state("peak"), state("background"), change("background", "peak", non.decreasing), change("peak", "background", penalty, non.increasing))

reduced.isotonic.regression <- list( state("anything"), change("anything", "anything", non.decreasing))

unimodal.regression <- list( state("can.change.up.or.down"), state("can.change.down"), change("can.change.up.or.down", "can.change.up.or.down", non.decreasing), change("can.change.up.or.down", "can.change.down", non.increasing), change("can.change.down", "can.change.down", non.increasing))

unimodal.at.least.one.up <- c(unimodal.regression, list( state("start"), start("start"), change("start", "can.change.up.or.down", non.decreasing)))

unimodal.at.least.one.up.and.down <- c(unimodal.at.least.one.up, list( end("can.change.down")))

checkModel <- function(model.list){ type.vec <- sapply(model.list, attr, "type") model.info <- sapply(unique(type.vec), function(type){ do.call(rbind, model.list[type.vec==type]) })

TODO error checking.

model.info

} checkModel(unimodal.at.least.one.up.and.down) checkModel(PeakSegFPOP)

TODO functions for plotting, solving.

GFPOP(model, data.vec, weight.vec, penalty)

+END_SRC

26 Jan 2017 Guillaume's group meeting presentation slides http://members.cbio.mines-paristech.fr/~thocking/HOCKING-PeakSegFPOP-pipeline-slides.pdf 10 Nov 2016

[[http://bl.ocks.org/tdhock/raw/9311ca39d643d127e04a088814c81ee1/][Data viz with smooth transitions, clarified titles]].

[[http://bl.ocks.org/tdhock/raw/7b595e74d059eb2e066d46a90c5b7724/][Revised interactive data viz]].

** 9 Nov 2016

[[http://bl.ocks.org/tdhock/raw/9a6ac163b8610314ed8e9751937ecea9/][Interactive data viz to explain supervised penalty learning for peaks]].

** 15 Aug 2016

Test accuracy and AUC data viz, explains why Segmentor gets such a high test accuracy (it has a low true positive and false positive rate) http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/

** 10 Aug 2016

[[http://cbio.ensmp.fr/~thocking/HOCKING-PeakSeg-functional-pruning-slides.pdf][Slides]] for group meeting presentation 11 Aug 2016.
http://bl.ocks.org/tdhock/raw/b796b4be10aa431575bb01ec16035b23/ shows min env in addition to min/less more computation.

** 3 Aug 2016

C++ algo implemented in [[https://github.com/tdhock/coseg][coseg]] package.
[[file:figure-PeakSegPDPA-demo.R]] created http://bl.ocks.org/tdhock/raw/8c5dd0af533e24a893e7c5232f9bc94c/ using average loss instead of total loss. ** 13 May 2016

[[file:figure-cDPA-PDPA-all.R]] visualizes the optimality and feasibility of the PDPA and cDPA models, and shows the interval counts in the PDPA [[http://bl.ocks.org/tdhock/raw/4582904f843cc60639fdfeb9651cac73/]]

** 12 May 2016

[[file:figure-cDPA-PDPA.R]] shows the difference between the cDPA and PDPA on real data: the cDPA recovers a sub-optimal solution that obeys the strict inequality peak constraint, and the PDPA recovers the optimal solution for the non-strict inequality peak constraint. http://bl.ocks.org/tdhock/raw/24aa6387901edab1577ce24f1e736ff3/

** 10 May 2016

[[file:figure-constrained-PDPA-normal-real.R]] makes http://cbio.ensmp.fr/~thocking/figure-constrained-PDPA-normal-real/ a data viz which shows the constrained algorithm up to 5 segments for a data set with 121 points.

** 4 May 2016

[[file:figure-constrained-PDPA-normal-panels.R]] implements the constrained PDPA algo with two kinds of min-less/min-more operators, inspired by two kinds of inequality constraints (strict and not). Visualization of running the algos up to 3 segments on 5 data sets with 4 data points each: [[http://bl.ocks.org/tdhock/raw/e924d180dda5d0cd1da8e8f556e741b7/]]
[[file:figure-unconstrained-PDPA-normal.R]] implements the unconstrained PDPA and visualizes the functional cost model and pruning [[http://cbio.ensmp.fr/~thocking/figure-unconstrained-PDPA-normal-big/]]

tdhock / PeakSegFPOP-paper

readme