phausamann / sklearn-xarray

Metadata-aware machine learning.
http://bit.do/sklearn-xarray
BSD 3-Clause "New" or "Revised" License
102 stars 12 forks source link

Transformer providing a rolling window #40

Closed kmsquire closed 6 years ago

kmsquire commented 6 years ago

I have a dataset which has features and targets indexed by time.

I would like to provide overlapping (possibly subsampled) windows of the features as feature input to an ML algorithm.

I can certainly construct this by hand, but I'm wondering how to provide this windowed input without copying data, possibly via a Transformer.

Is this possible within the existing list of transformers? This isn't clear to me. If it is not possible, how easy would it be to add a transformer to handle this?

Edit: I guess I can try to wrap array.rolling, although it's still unclear to me (so far) how to provide this to a scikit-learn fit function.

phausamann commented 6 years ago

There's the Segmenter in the preprocessing module that does exactly that, although it only works along one dimension so far. It should be easily extendable to more dimensions though, I based my implementation (utils.segment_array) on this StackOverflow question.

Also, it should work without copying data when the return_view parameter of the transformer is set to true, however, this isn't working yet and I have yet to investigate the reason.

If these limitations don't bother you, the transformer should do exactly what you want. Otherwise, at least the copying part is a priority for me that I want to fix soon, but feel free to look into it as well.

kmsquire commented 6 years ago

Great, thanks! It's close enough for now that I can work with it. Returning a view would be great, of course (although the default pip install version doesn't have that parameter yet).

One other thing I would like to do is shift the sample indices to center on (or trail) the index around which I'm grouping.

For example, if I set new_len=3, step=1 for a 100x10 DataArray (as from load_dummy_dataarray()), I'd like the resulting sample indices to go from 1 to 98. (or sometimes 2 to 99), so that they can be matched up with corresponding y values.

pandas.DataFrame.rolling for example, has a center keyword that accomplishes the first of these.

(Edit: Of course, I can set this manually for now.)

Just as a point of reference, one more thing that I sometimes want do is subsample the new dimension (which is done easily enough with slicing and which does currently return a view).

If I get the chance, I'll try to submit a pull request with an example for the docs.