microsoft / satclip

PyTorch implementation of SatCLIP
MIT License
188 stars 19 forks source link

Non-uniform distribution of S100 dataset #11

Open PlekhanovaElena opened 2 months ago

PlekhanovaElena commented 2 months ago

Hi there,

While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily. I visualized it here

s100_points_distribution

So, basically the problems are:

  1. oversampling in Greenland (probably due to Santinel 2 path, which has much higher visit frequency near poles)
  2. undersampling in tropics and 50-70°N lat

The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.

I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it. My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:

df['weight'] = 1./df.groupby('s2:mgrs_tile')['s2:mgrs_tile'].transform('count') sampledf = df.sample(100000, weights = df.weight)

I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)

Kind regards, Elena

konstantinklemmer commented 2 months ago

Fantastic, thanks for this analysis @PlekhanovaElena! I will link it in the main repository.