Non-uniform distribution of S100 dataset

Hi there,

While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily. I visualized it here

So, basically the problems are:

oversampling in Greenland (probably due to Santinel 2 path, which has much higher visit frequency near poles)
undersampling in tropics and 50-70°N lat

The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.

I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it. My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:

df['weight'] = 1./df.groupby('s2:mgrs_tile')['s2:mgrs_tile'].transform('count') sampledf = df.sample(100000, weights = df.weight)

I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)

Kind regards, Elena

microsoft / satclip

Non-uniform distribution of S100 dataset #11