radiantearth / geo-ml-model-catalog

Geospatial ML Model Catalog Spec
Apache License 2.0
52 stars 8 forks source link

How to capture randomly sampled training data? #13

Closed duckontheweb closed 3 years ago

duckontheweb commented 3 years ago

From the original Google Doc comments:

One of our training strategies is to randomly sample patches/labels from a large AOI. This will not give an exactly reproducible input data set (or label set).

How can we capture these kinds of randomly sampled training datasets in the Model Training Fragment?

HamedAlemo commented 3 years ago

This is a very good use case. Doesn't the ML AOI Extension address this? I mean if one samples the training dataset randomly, after the sampling is done they can record that in the catalog, right? I might be missing something here. @calebrob6 can you expand on this maybe?

calebrob6 commented 3 years ago

To expand on, "One of our training strategies is to randomly sample patches/labels from a large AOI. This will not give an exactly reproducible input data set (or label set)."

I'm imagining a situation where you have a large dataset AOI of size (H x W x C), with a corresponding label mask. One way to train a model with this dataset is to randomly sample a set of M patches, each of size (256 x 256 x C), in a pre-processing step, then train a model using these patches. Here, you might consider over-sampling, i.e. pick M >> ((H W) / (256256)), so that you don't "waste" any labels. In this case, it seems wrong to actually render out the pre-processed data to disk, as it will be larger than your original data. Another way to train a model is to simply do an online random sampling of patches over the AOI (e.g. in your model's dataloader). This works particularly well with COG formatted data because then you don't even need to have your very large image dataset on the same machine as your compute resources (e.g. we've found you can keep a V100 GPU busy using this pytorch Dataset implementation if your data lives in the same Azure region as the VM you are using for training).

Saving the coordinates of each patch you sample in the "online random sampling" step seems reasonable. You could easily extend the above dataloader to not randomly sample, and instead pull patches from a list of coordinates.

HamedAlemo commented 3 years ago

Thanks @calebrob6. I think this is really fit to be captured by the ML AOI Extension of STAC. Whether this is a random sampling or not, the coordinates can be stored in the metadata so one can reproduce the same results.

duckontheweb commented 3 years ago

It sounds like there's consensus that using the ML AOI Extension to describe the train/test/validation split should cover this case reasonably well, so I'm going to close this.

If more specific problems come up in an implementation, we can raise those in their own issues.