probcomp / trcrpm

Temporally-reweighted Chinese restaurant process mixture models for multivariate time series
Apache License 2.0
37 stars 17 forks source link

trcrpm #5

Closed nick-torenvliet closed 4 years ago

nick-torenvliet commented 4 years ago

I'm working with trcrpm on an industrial data set.

I am having trouble getting it to run on datasets much larger than those included in the tutorials.

For instance, running the basic tutorial on a time series much greater than 200 in length

[ ] 1.43%python2.7: cpp_code/src/State.cpp:267: double State::remove_feature(int, const std::vector&): Assertion abs(other_data_logp_delta - data_logp_delta) < 1E-6' failed. [= ] 3.73%python2.7: cpp_code/src/State.cpp:267: double State::remove_feature(int, const std::vector<double>&): Assertionabs(other_data_logp_delta - data_logp_delta) < 1E-6' failed. [======= ] 24.24%python2.7: cpp_code/src/State.cpp:267: double State::remove_feature(int, const std::vector&): Assertion `abs(other_data_logp_delta - data_logp_delta) < 1E-6' failed

I am wondering if you have anytips to get this working a little more smoothly.

Thanks-

nick-torenvliet commented 4 years ago

I ended up getting the code to run reasonably well using the docker file provided.

I am wondering, is there an example available with data updates - as in streaming time series data incoming?

fsaad commented 4 years ago

Hi @nick-torenvliet, thanks for the ticket.

For data updates, you can use incorporate. The index of the frame usually specifies the new timepoints, but can also include values at previous timepoints which were NaN. If you let me know the pattern of data you are trying to incorporate then I can suggest the corresponding invocations to the API.

nick-torenvliet commented 4 years ago

Hi @fsaad -

Let me describe my problem/data to you... I've got an industrial time series - thousands of sensors and devices delivering pressures, temperatures, rates of flow, valve % open closed - etc. every two seconds. I have years of data to experiment with.

The system is mechanical, deterministic, with lots of casual type dependencies... so real objective "clusters" of behaviour exist, and actions have reactions that are predictable-tractable by a human agent; which leads me to believe there is a lot of structure for the math to find.

The data is basically stationary but with seasonality - ambient atmospheric conditions move the system means up and down as time progresses through the seasons.

We want to find anomalies in the data due to equipment malfunction. These happen infrequently, so no supervised learning labelled datasets are practically available. Anomalies are also unpredictable, so no number of rules can adequately describe them. And given the number of systems - we don't have the time to "tune" anomaly detection.

The intention is to capture/flag anomalies by paying attention to the rate of new table assignments from the crp.

So I have an incoming stream of data - say the temperature of a water flow - I want to pass that data into a model update and then look at the model properties to assess whether the data seems normal or anomalous.

If you have any example code for say (semi)real time updating with financial series... that could be helpful. Also any papers or sources you know of would be great. BTW your published lit has been very helpful... so thanks for that.

Let me know if you want to take the conversation off-line,

Thanks!

nick-torenvliet commented 4 years ago

Hi @fsaad,

I'm wondering if you have an example that show how to pull out the clustering information from a model like was done with the Gapminder GDP dataset.

Thanks,

Nick