Closed mmann1123 closed 3 years ago
Thanks @mmann1123, although I would probably go another route with this. The docstring clarifications are definitely necessary, and I'd probably add a paragraph to the user guide explaining what sample and feature dimensions refer to.
In any case, the Featurizer
is used when you have a single sample dimension and everything else is considered a feature dimension, e.g. if you have raster data with dimensions (time, x, y, band)
and you want to predict something across the whole image for each timestep then time
would be the sample dim and the rest would be your feature dims.
In your case, it's the other way around, there's only one feature dim (band
) and the rest are sample dims. I would probably add a dedicated transformer (Samplerizer
?) for that, or maybe create more general Stacker
/Unstacker
transformers.
If you want to give that a try go ahead, otherwise I can also take care of it.
Hey @phausamann I can take a try. Although I have to say I am still relatively new to python so I can't make any promises.
Looks pretty good already. You should add a test in tests/test_preprocessing.py
though, especially since you've come across a case where it doesn't work in a pipeline. You can probably use test_featurize
as a guide for writing such a test. I'll also leave a couple of more specific remarks as review comments.
I've added a slightly modified Stacker transformer to the develop branch (#54), so I'll be closing this PR if that's okay. I've also added a new section to the docs that explains sample and feature dimensions and how to pre-process xarray data such that it has the dimensionality that sklearn expects.
Adding ability to handle multidimensional sample by prestacking along those dims. Also trying to clarify documentation (for the arg I actually think I understand)
NOTES: I think this is working well for DataArrays but not sure about DataSets since I don't know what they should look like. PASSES TESTS etc
referencing issue #52
Make sure you have:
doc/content/whatsnew.rst