pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Suggestion: interpolation of non-numerical data #3763

Open scottcanoe opened 4 years ago

scottcanoe commented 4 years ago

I'd like to suggest an improvement to enable a resampling mechanism for non-numerical data. In my use case, I have time series data, where each timepoint is associated with a measured variable (e.g., fluorescence) as well as a label indicating the stimulus being presented (e.g., "A"). However, if and when I need to upsample my data, the string-valued stimulus information is lost, and its imperative that the stimulus information is still present when working on the resampled data.

My solution to this problem has been to map the labels to integers, use nearest-neighbor interpolation on the integer-valued representation, and finally map the integers back to labels. (I'm willing to bet there's a name for this technique, but I wasn't able to find it by googling around for it.)

I'm new to xarray, but so far as I can tell this functionality is not provided. More specifically, calling DataArray.interp on a string-valued array results in a type error (<builtins.TypeError: interp only works for a numeric type array. Given <U1.>).

Finally, I'd like to applaud you for your work on xarray. I only wish I had found it sooner!

crusaderky commented 4 years ago

Hi Scott,

I can't think of a generic situation where text labels have a numerical weight that is hardcoded to their position on the alphabet, e.g. mean("A", "C") = "B". What one typically does is map the labels (any string) to their (arbitrary) weights, interpolate the weights, and then do a nearest-neighbour interpolation (or floor or ceil, depending on the preference) back to the label. Which is what you described but with the special caveat that your weights are the ASCII codes for your labels.

On Sat, 8 Feb 2020 at 20:43, scottcanoe notifications@github.com wrote:

I'd like to suggest an improvement to enable a repeat-based interpolation mechanism for non-numerical data. In my use case, I have time series data (dim='t'), where each timepoint is associated with a measured variable (e.g., fluorescence) as well as a label indicating the stimulus being presented (e.g., "A"). However, if and when I need to upsample my data, the string-valued stimulus information is lost, and its imperative that the stimulus information is still present when working on the resampled data.

My solution to this problem has been to map the labels to integers, use nearest-neighbor interpolation on the integer-valued representation, and finally map the integers back to labels. (I'm willing to bet there's a name for this technique, but I wasn't able to find it by googling around for it.)

I'm new to xarray, but so far as I can tell this functionality is not provided. More specifically, calling DataArray.interp on a string-valued array results in a type error (<builtins.TypeError: interp only works for a numeric type array. Given <U1.>).

Finally, I'd like to applaud you for your work on xarray. I only wish I had found it sooner!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3763?email_source=notifications&email_token=ABPM4MER3APWULR2QQVFE23RB4KOTA5CNFSM4KR43K22YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IMAS3NA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MEFUTJISHNCHFOYEXLRB4KOTANCNFSM4KR43K2Q .

DancingQuanta commented 4 years ago

Sounds like a technique in data science, encoding strings, which is actually number of different techniques.

scottcanoe commented 4 years ago

Hi all, thanks for the reply. Just to clarify, I'm making the suggestion that any one (or more) of these categorical interpolation techniques be incorporated into the internals of xarray so that any categorical arrays present in the dataset (properly aligned to a given dimension, of course) are interpolated automatically. As it stands, resampling such "mixed" datasets requires manually partitioning the numerical arrays from the categorical arrays and handling their interpolation separately. What makes xarray so appealing to me is how much of the laborious, error-prone, and not-so-extensible coding I've had to do in order to maintain relationships between various objects. It just seems to me like there is an opportunity here to push more into the background.

Forgive me if I'm mistaken or if this view is naive or possibly just a bad idea. I've only been working with xarray for a couple of days. Thanks again.

DancingQuanta commented 4 years ago

I suggest that in order to convince xarrsy developers to help you is to provide an example data and show what you have tried with your string encoding solution and describe applications for the method. You should check out pandas which xarrsy extends and is more widely used then xarray. Hopefully someone have a similar problem as you with pandas and you can write here how to apply their solutions.

shoyer commented 4 years ago

Could you share an small example of what you’d like to do, ideally on synthetic data?