holoviz-topics / EarthML

Tools for working with machine learning in earth science
https://earthml.holoviz.org
BSD 3-Clause "New" or "Revised" License
94 stars 21 forks source link

Tools for reshaping multidimensional arrays for use with sklearn #1

Closed jbednar closed 5 years ago

jbednar commented 6 years ago

In the EarthML project, we need to apply machine-learning tools like sklearn to multidimensional array data like xarrays and other data that doesn't fit naturally into sklearn's single-column input data format. Of course, arrays can be flattened before running the algorithm, then reshaped in the opposite way afterwards, as in Tom Augspurger's spectral clustering example.

However, doing so is awkward and error prone and likely to lose metadata like lat,lon coordinates, especially for more complicated multidimensional arrays with data that needs to be selected along some certain range of dimensions or sliced inside the array, and then restored to that range and slice afterwards for analysis and visualization. It's presumably especially painful and error prone if the dimensionality changed as a result of any of the sklearn operations (e.g. PCA).

Existing libraries to deal with these issues take one of two approaches:

  1. making an xarray object aware of sklearn functionality: xarray_filters
  2. making sklearn able to ingest xarray objects: phausamann/sklearn-xarray, nbren12/sklearn-xarray

These two approaches have very different implications, and it would be good if we can tease those out explicitly here before moving forward with a particular choice. They are also at very different stages of development, with phausamann/sklearn-xarray seeming further along than the rest, but I don't know how well any of them handle the end stages in this process (reshaping back to the original shape or some appropriately reduced version of it). @jlstevens, any thoughts?

jlstevens commented 6 years ago

As you say, I think there are different implications depending on the approach chosen and it is important to think through those implications carefully.

Starting with Tom's notebook and any other simple example I can find, I will report back in this issue once I have an opinion about how the two approaches differ, considering xarray_filters relative to the other two libraries.

mrocklin commented 6 years ago

I encourage people here to reach out to the authors of those packages. They may be able to help guide you towards correct use and inform you about why they made the decisions that they made. I also encourage you to reach out to people who have used these packages to hear what they like and dislike about them. That might help to ensure broader impact.

jbednar commented 5 years ago

Addressed with xarray examples.