lamtung16 / ML_ChangepointDetection

0 stars 0 forks source link

Paper Revision #10

Open lamtung16 opened 1 month ago

lamtung16 commented 1 month ago

@tdhock this is the outline of my new paper, can you give me some feedbacks

Learning Penalty Parameters for Optimal Partitioning via Automatic Feature Extraction

Abstract

Changepoint detection is a technique used to identify significant shifts in data sequences, which is crucial in various fields such as finance, genomics, and medicine. The Optimal Partitioning (OPART) algorithm locates these changes within a sequence and uses a penalty parameter to control the number of detected changepoints. Traditionally, methods involved manually extracting statistical features from sequences to form feature vectors for predictive models that estimate the penalty value. This study introduces a novel approach that learns the penalty parameter directly from sequences by utilizing recurrent architecture networks to automatically extract relevant features that aid in determining the penalty.

Introduction

Novelty

Experiments

image image

tdhock commented 1 month ago

great

lamtung16 commented 1 month ago

hi @tdhock, I update my study:

  1. So far, I implemented:

summary:

  1. I have been trying to experiment in more datasets (17 sub-datasets in https://archive.ics.uci.edu/dataset/439/chipseq)
    • one problem: the length of these sequences is too long (some of them is more than 11 minion), then implement rnn takes forever --> not applicable
    • idea: I think about compressing sequences (e.g., taking mean or median or random on each 1000-length segment --> the length of compressed one is 1000 times shorter than the original one, or something like that), then apply rnn, lstm, gru on compressed ones.
tdhock commented 4 weeks ago

yes the GRU results look good.

yes the sequences in UCI chipseq are very large.

yes you can try "compressing sequences" which I think would be the same as doing pooling with window size 1000 right?

lamtung16 commented 3 weeks ago

@tdhock some updates today:

Stat Dataset Min Seq Length Max Seq Length Min Value Max Value Mean Variance Non-Inf Min Low Limit Non-Inf Max Upper Limit
cancer (1) 39 43628 -6.41 0.075 0.063 -5.75 6.9
detailed (1) 25 5937 -7.67 9.87 0.029 -4.97 6.19
systematic (1) 66 5937 -7.67 9.87 0.027 -4.84 6.19
chipseq (17) 275 11499958 0 31488 11270.19 5.44 20.09

Achievement

Problem