Scaling of Data - Githubissues

hildensia / bayesian_changepoint_detection

Methods to get the probability of a changepoint in a time series.

MIT License

667 stars 214 forks source link

Scaling of Data #34

Open stefan37 opened 2 years ago

stefan37 commented 2 years ago

Hi, I've noticed is the scaling of the data can have an effect on the result, but I am not sure why it would and can't find any reason for it in the code or references. Below I have the CP probabilities for the same data with or without a constant factor, which are somewhat different.

Are there some assumptions about the input data I am missing? Thanks

hildensia commented 2 years ago

The student T likelihood scales with the squared mean distance, which is non-linear w.r.t. to data scaling.

https://github.com/hildensia/bayesian_changepoint_detection/blob/2dd95f5c1d028116899a842ccb3baa173f9d5be9/bayesian_changepoint_detection/offline_likelihoods.py#L138

Also intuitively that makes sense, because the difference of your generative models is now different, and thus the probability of them being the same/different should be different

stefan37 commented 2 years ago

Thanks for quick reply. Confusion for me is that often the scale is arbitrary, such as if there are multiple ways to make some data dimensionless, yet that could yield vastly different results; my assumption before was that I should always just always normalize over the entire time series. Is there some prior used here in calculating the student T likelihood that I should keep in mind with how I scale my data, or any other way to decide the scale?

hildensia commented 2 years ago

Good question. I would believe that mean centering your data is probably a good idea. But w.r.t. scaling I have to think a bit more. It has probably to do with an implicit prior somewhere, but I cannot pinpoint it right now.