microsoft / torchgeo

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data
https://www.osgeo.org/projects/torchgeo/
MIT License
2.78k stars 351 forks source link

weighted geo sampler #757

Open Geethen opened 2 years ago

Geethen commented 2 years ago

Summary

in the scenario of imbalanced datasets, the use of the current samplers may not assist with imbalanced samples.

I am currently trying to get rid of samples with only 0 metre heights in the mask (water regions).

Rationale

No response

Implementation

No response

Alternatives

No response

Additional information

No response

adamjstewart commented 2 years ago

I like the idea, but how would you implement it? Unlike NonGeoDatasets, GeoDatasets will recursively search for files on disk, so you can't just pass in a list of weights. You could compute those weights, but how would you make a single class that is generic enough to allow users to do this?

calebrob6 commented 2 years ago

You could get a list of filenames from RasterDataset's index, compute weights, then pass those to the sampler. I'll note this is a good reason why RasterDatasets should be able to be instantiated from a list of filenames.

On Mon, Sep 5, 2022 at 10:02 AM Adam J. Stewart @.***> wrote:

I like the idea, but how would you implement it? Unlike NonGeoDatasets, GeoDatasets will recursively search for files on disk, so you can't just pass in a list of weights. You could compute those weights, but how would you make a single class that is generic enough to allow users to do this?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/issues/757#issuecomment-1237311585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJUTUUTYP7NYQUHZNPDALV4YRQRANCNFSM6AAAAAAQE3BW3U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

adamjstewart commented 2 years ago

You could get a list of filenames from RasterDataset's index, compute weights, then pass those to the sampler.

This feels a bit fragile. For example, if your dataset is an IntersectionDataset or UnionDataset, you now need to be more careful because each "hit" could be both image and label, or from a different dataset entirely. But yes, this could work.

I'll note this is a good reason why RasterDatasets should be able to be instantiated from a list of filenames.

Should be easier to support a list of filenames for instantiation when we move to TorchData.

calebrob6 commented 2 years ago

Should be easier to support a list of filenames for instantiation when we move to TorchData.

I don't understand why this is particularly hard now, I guess I need to try it.

On Mon, Sep 5, 2022 at 11:44 AM Adam J. Stewart @.***> wrote:

You could get a list of filenames from RasterDataset's index, compute weights, then pass those to the sampler.

This feels a bit fragile. For example, if your dataset is an IntersectionDataset or UnionDataset, you now need to be more careful because each "hit" could be both image and label, or from a different dataset entirely. But yes, this could work.

I'll note this is a good reason why RasterDatasets should be able to be instantiated from a list of filenames.

Should be easier to support a list of filenames for instantiation when we move to TorchData.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/torchgeo/issues/757#issuecomment-1237385585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJUTV2E2S6BCZ65NMPFU3V4Y5PTANCNFSM6AAAAAAQE3BW3U . You are receiving this because you commented.Message ID: @.***>

adamjstewart commented 2 years ago

It's not hard to support without TorchData, but it becomes easier to support with TorchData because the user can construct their own data loading pipeline with a set of common operations. So they can choose whether they want to specify a list of files, or recursively search a directory, or use a STAC API, or whatever. I also still need to investigate TorchData. I'm hoping it doesn't put all of the work on the user.

isaaccorley commented 2 years ago

This seems like 2 separate problems.

  1. Dealing with sampling from imbalanced datasets
  2. You are trying to remove areas where a value in a mask is zero. Could a possible solution be to create another mask Raster Dataset where values aren't 0 in the original mask and then take the intersection of these?
Geethen commented 2 years ago

This seems like 2 separate problems.

  1. Dealing with sampling from imbalanced datasets

This is the broader problem. One of the ways I would approach this would be to generate a grid based on a user-specified criteria (pixel width, pixel height and nSamples), then get the percentage cover of each label value per grid cell (patch), lastly filter out any patches that do not meet the weight criteria specified by the user? for example, in my case, any cell with less than equal to 50% cover of zero is allowed. I could quickly and easily do this in earth engine but have no idea how to go about this using python. I will implement this in GEE to preprocess the data I use for now. In the case of a regression problem and in my case, it just the zero value that is problematic. so the problem is slightly more simplified compared to multi-class classification problem.

  1. You are trying to remove areas where a value in a mask is zero. Could a possible solution be to create another mask Raster Dataset where values aren't 0 in the original mask and then take the intersection of these?

it is beneficial to have some zero labels to learn from. Also I do not think torchgeo supports irregular polygons, only bounding boxes for intersection datasets.

adamjstewart commented 3 weeks ago

FYI, we are planning on working on this for our time series efforts. All samplers will allow users to pass in weights, not just a single WeightedGeoSampler.