Closed LuYueee closed 4 years ago
We need to have a discussion about how this fits into the API. This is all code that I initially developed and asked Frankie @frankiecancino to work on. Let's discuss this via discord or on Tuesday evening.
Will you please add tests for every function in preprocess? A user could make use of impute directly requiring its own validation etc.
We need to discuss additional arguments for compute and analyze functions. Compute probably needs to accept a dict of preprocessing arguments or None to avoid preprocessing. This will also carry over in analyze. The annotation vector tests are failing because we are forcing the preprocessing to run through compute every time which is fine to do by default, however there should be a toggle to not run preprocessing procedure.
We need to discuss additional arguments for compute and analyze functions. Compute probably needs to accept a dict of preprocessing arguments or None to avoid preprocessing. This will also carry over in analyze. The annotation vector tests are failing because we are forcing the preprocessing to run through compute every time which is fine to do by default, however there should be a toggle to not run preprocessing procedure.
Ok. I'll try to add arguments to the compute and analyze functions.
Here are some additional suggestions to make the code a little cleaner and provide expected results to existing users:
# in preprocess.py
def validate_preprocess_kwargs(kwargs):
# test the arguments and raise any errors
return valid_args
# in compute.py and analyze.py
...
preprocess_kwargs = validate_preprocess_kwargs(preprocess_kwargs)
...
Another option for imputation could be to use the current window instead of only backward and forward. This can be a little tricky depending on how many values are missing in the window. We could probably have some logic to test what proportion of data is missing and default to a backward or forward approach when it makes sense. For example, use the current window when 50% or more of the current window has values. I'm not sure about the proportion that makes sense here. It could be an additional argument or two? What do you think?
In my original implementation that you took over, I used the current window for imputation by default with the intention of using backward or forward filling only when some proportion of data was missing in the current window.
Another option for imputation could be to use the current window instead of only backward and forward. This can be a little tricky depending on how many values are missing in the window. We could probably have some logic to test what proportion of data is missing and default to a backward or forward approach when it makes sense. For example, use the current window when 50% or more of the current window has values. I'm not sure about the proportion that makes sense here. It could be an additional argument or two? What do you think?
In my original implementation that you took over, I used the current window for imputation by default with the intention of using backward or forward filling only when some proportion of data was missing in the current window.
I suppose I used the current window for imputation. The 'forward' and 'backward' parameters are just to set the sliding direction of the current window. Besides, when 75% of the current window has value, it will trigger the data imputation procedure. (75% because window = 4 by default; the proportion is 80% when window = 5)
The only difference between this and your original implementation is that I update 'nan_infs' every time the ‘temp' array is filled with value(s). I noticed that the original implementation would frequently discard the values calculated in the previous loop, so I added one line of code (line 197/line 212) to the original implementation to make sure that these calculated values are used for the final result and avoid redundant calculations. I think this simple method would reduce the computation load as well as the frequency of register value updates, and more importantly, it can naturally take care of the "all missing" condition. It seems to me that if we take extra steps to deal with the "all missing" condition, we may need additional data structures or iterations or nested loops, which might increase the overhead of CPU/memory.
Before I submitted this PR, I discussed this method with Andrew and Frankie and showed them the final output for the preprocess function. However, if you think this method may not be acceptable, we can discuss it tomorrow.
Merging #46 into master will increase coverage by
0.39%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #46 +/- ##
==========================================
+ Coverage 91.79% 92.18% +0.39%
==========================================
Files 29 30 +1
Lines 2205 2316 +111
==========================================
+ Hits 2024 2135 +111
Misses 181 181
Impacted Files | Coverage Δ | |
---|---|---|
matrixprofile/analyze.py | 96.82% <100.00%> (+0.27%) |
:arrow_up: |
matrixprofile/compute.py | 93.02% <100.00%> (+0.91%) |
:arrow_up: |
matrixprofile/preprocess.py | 100.00% <100.00%> (ø) |
|
matrixprofile/visualize.py | 96.37% <100.00%> (ø) |
|
matrixprofile/algorithms/snippets.py | 100.00% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 40cdceb...b1a9704. Read the comment docs.
Yes, I understand the forward and backward approach for the sliding window. My comment in the original implementation was suggesting cases where the entire window is empty. In this case, you could use the previous window's values or the next window's values. This can be pretty tricky in the long run and should probably be avoided. The approach you have is a pretty good starting point. I'll take some time to suggest additional changes.
Integrating the data imputation into algorithms.