matrix-profile-foundation / matrixprofile

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
https://matrixprofile.org
Apache License 2.0
360 stars 62 forks source link

updated mpx formula #39

Closed kavj closed 3 years ago

kavj commented 4 years ago

This removes the use of the twisted factorization a b - c d = (1/2) ((a + b) (c - d) + (a - b) * (c + d)) from the difference formulas. While that one requires slightly less memory access in the case of a self join, it seems to fail in cases containing missing data. Further, the reduction step can sometimes make it difficult to tell when it first diverges to a meaningful degree.

I will probably suggest a strategy for restarting calculations bordering missing data regions at a later time, but in addition to that, I haven't observed complete failure in this case. I suspect the other sometimes added the product of an underflowing value and a large value, which is problematic here.

codecov[bot] commented 4 years ago

Codecov Report

Merging #39 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #39   +/-   ##
=======================================
  Coverage   91.69%   91.69%           
=======================================
  Files          29       29           
  Lines        2155     2155           
=======================================
  Hits         1976     1976           
  Misses        179      179           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf9c4f5...9067d47. Read the comment docs.

peterdhansen commented 4 years ago

I might suggest squashing commits and rebasing wrt master.

Which I can help with if needed.

kavj commented 4 years ago

I might suggest squashing commits and rebasing wrt master.

I think I messed up the rebase last time. I would have preferred to include on an experimental branch or code section, but it's not clear that there's a section for that, which would ensure that tests are still run. It's meant to remove a source of observed cancellation issues.

A followup to this will add detection of bad data regions. This allows the problem to be sub-tiled across normal regions, where all data windows in that region admit a normalized representation.

I may also suggest an update to mean and inverse norm to further avoid propagating any rounding across windows. It's quite difficult to obtain optimal reliability with look-ahead methods.

tylerwmarrs commented 3 years ago

@kavj Have you taken the time to implement the "ab-join" and parallel logic? Where does this code stand overall?

kavj commented 3 years ago

@kavj Have you taken the time to implement the "ab-join" and parallel logic? Where does this code stand overall?

This was on the master branch. I'm not sure that's the place for it.

Ideally this would be factored out to work with this section and the streaming section.