holoviz-topics / EarthML

Tools for working with machine learning in earth science
https://earthml.holoviz.org
BSD 3-Clause "New" or "Revised" License
94 stars 21 forks source link

modified example notebooks #4

Closed ebo closed 6 years ago

ebo commented 6 years ago

I was working under a branch of pyvis-topics/EarthML and something was not looking right so I forked the repository and figured that I would send pull requests from here. I can change this workflow if you really want, but I figure that is safer.

I modified a couple of older examples which transposed the input data so that rasterio, xarray, dask-ml and holoview process and display the data in a reasonable way. I wanted to post the current state of the code before leaving for the day.

I also started setting up for the end-to-end example of replicating a study of lake volume change that was recently published in Nature Geosciences. This is only a start, but the intention is to replicate the initial image processing of the study.

jlstevens commented 6 years ago

... so I forked the repository and figured that I would send pull requests from here. I can change this workflow if you really want, but I figure that is safer.

Seems like a good workflow to me!

I'm having a look through the notebooks now. Thanks for clearing landsat_spectral_clustering_xa.ipynb in the PR: I have removed the commit from master to avoid bloating the repo (and I have a copy of the notebook with output locally for reference).

mrocklin commented 6 years ago

Thanks @ebo !

I just ran though things. Here are my initial observations. Hopefully these are the same problems that you want fixed :)

  1. We're loading a modest array into memory completely. My understanding is that arr might become much much larger in the future.
  2. Despite the relatively small size of this array, computing the spectral clustering seems to take around 10GB of memory, which is far more than we would like. Presumably as we scale out arr this number will continue well beyond these limits, which is concerning.

Is this correct? Are there other concerns here that I'm missing?

ebo commented 6 years ago

I have so many different little test notebooks laying around in attempts to figure this out I cannot keep them all straight, so I will answer your question in general.

I have been handed images as large as 34,000 x 245,000 pixels with 8 16-bit bands that I will need to process. There is no way that I will be able to read them completely into memory and process them on any of the VM's I realistically have access to. In broad strokes these images have the same structure as Landsat images, and I was trying to come up with publicly distributable examples that we can all work through. In addition, to compare dask/xarray/rasterio and friends to previous versions of the machine learning code, I need to limit the memory footprint to 4GB of RAM and 1 or two threads/core (regardless of the actual size of the VM). So regardless of the size of the example, assume that any example other than a unit/regression test will have to scale to images 10 to 100 times the size.

Also, I did not realize that I was the one that needed to confirm the pull request. I have several new examples merged now.