phausamann / sklearn-xarray

Metadata-aware machine learning.
http://bit.do/sklearn-xarray
BSD 3-Clause "New" or "Revised" License
102 stars 12 forks source link

Multidimensional sample #53

Closed mmann1123 closed 3 years ago

mmann1123 commented 3 years ago

Adding ability to handle multidimensional sample by prestacking along those dims. Also trying to clarify documentation (for the arg I actually think I understand)

NOTES: I think this is working well for DataArrays but not sure about DataSets since I don't know what they should look like. PASSES TESTS etc

referencing issue #52

Make sure you have:

phausamann commented 3 years ago

Thanks @mmann1123, although I would probably go another route with this. The docstring clarifications are definitely necessary, and I'd probably add a paragraph to the user guide explaining what sample and feature dimensions refer to.

In any case, the Featurizer is used when you have a single sample dimension and everything else is considered a feature dimension, e.g. if you have raster data with dimensions (time, x, y, band) and you want to predict something across the whole image for each timestep then time would be the sample dim and the rest would be your feature dims.

In your case, it's the other way around, there's only one feature dim (band) and the rest are sample dims. I would probably add a dedicated transformer (Samplerizer?) for that, or maybe create more general Stacker/Unstacker transformers.

If you want to give that a try go ahead, otherwise I can also take care of it.

mmann1123 commented 3 years ago

Hey @phausamann I can take a try. Although I have to say I am still relatively new to python so I can't make any promises.

phausamann commented 3 years ago

Looks pretty good already. You should add a test in tests/test_preprocessing.py though, especially since you've come across a case where it doesn't work in a pipeline. You can probably use test_featurize as a guide for writing such a test. I'll also leave a couple of more specific remarks as review comments.

phausamann commented 3 years ago

I've added a slightly modified Stacker transformer to the develop branch (#54), so I'll be closing this PR if that's okay. I've also added a new section to the docs that explains sample and feature dimensions and how to pre-process xarray data such that it has the dimensionality that sklearn expects.