matrix-profile-foundation / matrixprofile

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
https://matrixprofile.org
Apache License 2.0
360 stars 62 forks source link

Handling missing data #46

Closed LuYueee closed 4 years ago

LuYueee commented 4 years ago

Integrating the data imputation into algorithms.

tylerwmarrs commented 4 years ago

We need to have a discussion about how this fits into the API. This is all code that I initially developed and asked Frankie @frankiecancino to work on. Let's discuss this via discord or on Tuesday evening.

tylerwmarrs commented 4 years ago

Will you please add tests for every function in preprocess? A user could make use of impute directly requiring its own validation etc.

tylerwmarrs commented 4 years ago

We need to discuss additional arguments for compute and analyze functions. Compute probably needs to accept a dict of preprocessing arguments or None to avoid preprocessing. This will also carry over in analyze. The annotation vector tests are failing because we are forcing the preprocessing to run through compute every time which is fine to do by default, however there should be a toggle to not run preprocessing procedure.

LuYueee commented 4 years ago

We need to discuss additional arguments for compute and analyze functions. Compute probably needs to accept a dict of preprocessing arguments or None to avoid preprocessing. This will also carry over in analyze. The annotation vector tests are failing because we are forcing the preprocessing to run through compute every time which is fine to do by default, however there should be a toggle to not run preprocessing procedure.

Ok. I'll try to add arguments to the compute and analyze functions.

tylerwmarrs commented 4 years ago

Here are some additional suggestions to make the code a little cleaner and provide expected results to existing users:

  1. To make the code less redundant, the preprocess argument parsing can be placed in a function within the preprocess module. Something like the following:
# in preprocess.py
def validate_preprocess_kwargs(kwargs):
    # test the arguments and raise any errors
    return valid_args

# in compute.py and analyze.py
...
preprocess_kwargs = validate_preprocess_kwargs(preprocess_kwargs)
...
  1. In the preprocess unit tests test the validation function. Ensure that None does nothing and adjusting each argument does what you expect.
  2. Add test in compute and analyze tests to test the functionality respectively.
  3. Rename the argument in compute and analyze from preprocessing_args to preprocessing_kwargs. There is a subtle difference in meaning between args and kwargs. Args tend to mean "ordered arguments" while kwargs mean "keyword arguments" specifically in Python.
  4. To keep existing users from having differing answers on package upgrade (ie 1.1.4 to 1.1.5), we need to set the default for preprocessing_kwargs to None. It is a pretty significant change and assumption to make about an individual's data. Take for example how the annotation vector unit tests are written and are currently failing because of forcing the imputation and constant value handling.
tylerwmarrs commented 4 years ago

Another option for imputation could be to use the current window instead of only backward and forward. This can be a little tricky depending on how many values are missing in the window. We could probably have some logic to test what proportion of data is missing and default to a backward or forward approach when it makes sense. For example, use the current window when 50% or more of the current window has values. I'm not sure about the proportion that makes sense here. It could be an additional argument or two? What do you think?

In my original implementation that you took over, I used the current window for imputation by default with the intention of using backward or forward filling only when some proportion of data was missing in the current window.

LuYueee commented 4 years ago

Another option for imputation could be to use the current window instead of only backward and forward. This can be a little tricky depending on how many values are missing in the window. We could probably have some logic to test what proportion of data is missing and default to a backward or forward approach when it makes sense. For example, use the current window when 50% or more of the current window has values. I'm not sure about the proportion that makes sense here. It could be an additional argument or two? What do you think?

In my original implementation that you took over, I used the current window for imputation by default with the intention of using backward or forward filling only when some proportion of data was missing in the current window.

I suppose I used the current window for imputation. The 'forward' and 'backward' parameters are just to set the sliding direction of the current window. Besides, when 75% of the current window has value, it will trigger the data imputation procedure. (75% because window = 4 by default; the proportion is 80% when window = 5)

The only difference between this and your original implementation is that I update 'nan_infs' every time the ‘temp' array is filled with value(s). I noticed that the original implementation would frequently discard the values calculated in the previous loop, so I added one line of code (line 197/line 212) to the original implementation to make sure that these calculated values are used for the final result and avoid redundant calculations. I think this simple method would reduce the computation load as well as the frequency of register value updates, and more importantly, it can naturally take care of the "all missing" condition. It seems to me that if we take extra steps to deal with the "all missing" condition, we may need additional data structures or iterations or nested loops, which might increase the overhead of CPU/memory.

Before I submitted this PR, I discussed this method with Andrew and Frankie and showed them the final output for the preprocess function. However, if you think this method may not be acceptable, we can discuss it tomorrow.

codecov[bot] commented 4 years ago

Codecov Report

Merging #46 into master will increase coverage by 0.39%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   91.79%   92.18%   +0.39%     
==========================================
  Files          29       30       +1     
  Lines        2205     2316     +111     
==========================================
+ Hits         2024     2135     +111     
  Misses        181      181              
Impacted Files Coverage Δ
matrixprofile/analyze.py 96.82% <100.00%> (+0.27%) :arrow_up:
matrixprofile/compute.py 93.02% <100.00%> (+0.91%) :arrow_up:
matrixprofile/preprocess.py 100.00% <100.00%> (ø)
matrixprofile/visualize.py 96.37% <100.00%> (ø)
matrixprofile/algorithms/snippets.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 40cdceb...b1a9704. Read the comment docs.

tylerwmarrs commented 4 years ago

Yes, I understand the forward and backward approach for the sliding window. My comment in the original implementation was suggesting cases where the entire window is empty. In this case, you could use the previous window's values or the next window's values. This can be pretty tricky in the long run and should probably be avoided. The approach you have is a pretty good starting point. I'll take some time to suggest additional changes.