kylessmith / linear_segment

Segmentation of linear data
GNU General Public License v2.0
1 stars 2 forks source link

Suspected off-by-one indexing issue with labeling interval endpoints #1

Closed mikewojnowicz closed 1 month ago

mikewojnowicz commented 2 months ago

Hi,

Let me define a changepoint as the index of a new segment start. Below I create a sequence x where the changepoints are at indices 0,10,and 30 (in Python's zero-indexing). Then I fit CBS to it using your package.

import numpy as np 
from linear_segment import segment

T = 50
x = np.zeros(T)
x[10:30] = 1.0
labels = np.repeat("a", T)   # "a" is a dummy label
segments = segment(x, labels, method="cbs")
print(segments)

The return value is

LabeledIntervalArray
   (0-9, a)
   (9-30, a)
   (30-50, a)

Presumably there is some sort of mistake in how the segment starts/ends are labeled? Based on both the ground truth, as well as internal consistency, I would have expected the result to be

LabeledIntervalArray
   (0-9, a)
   (10-29, a)
   (30-50, a)

Is there perhaps a bug? I don't know how to code in C, so I didn't check the underlying source code.

mikewojnowicz commented 2 months ago

I was considering whether I could write a function to post-hoc correct the returned values. However I don't think that it is possible. For example, running the following code

T = 50
x = np.zeros(T)
x[10:20] = 1.0
x[30:40] = 1.0

labels = np.repeat("a", T)   # "a" is a dummy label
segments = segment(x, labels, method="cbs")
print(segments)

gives

LabeledIntervalArray
   (0-9, a)
   (9-19, a)
   (19-30, a)
   (30-40, a)
   (40-50, a)

rather than the expected

LabeledIntervalArray
   (0-9, a)
   (10-19, a)
   (20-29, a)
   (30-39, a)
   (40-50, a)

so the true changepoint seems to be sometimes given by the returned left endpoint, and other times by that value plus 1.

kylessmith commented 1 month ago

Thanks for pointing this out! Sorry for the late response. I have just pushed an update that will hopefully fix this (v1.2.0).

from linear_segment import segment import numpy as np

Create data

np.random.seed(10) T = 50 x = np.zeros(T) x[10:20] = 1.0 x[30:40] = 1.0

labels = np.repeat("a", T) # "a" is a dummy label

Calculate segments

segments = segment(x, labels, method="cbs", shuffles=200, p=0.05) print(segments)

LabeledIntervalArray (0-10, a) (10-20, a) (20-30, a) (30-40, a) (40-50, a)

The endpoint in non-inclusive. This is to better match with indexing of the original array

x[0:10] array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) x[10:20] array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Let me know if the fix works.

mikewojnowicz commented 1 month ago

I pip installed v1.2.0 and the fix works on my end.