Closed jonasbarth closed 1 year ago
I was thinking about the preparation of the dataset, and I came up with this:
np array
of size $b \cdot C + m$, where:
Thanks, that is super detailed and useful! These are the 15 channels that we have. source
Following last commit I obtained the following graphs:
China percentage for each slice of the trainset:
Average Glacier content for each of the slices in trainset:
Now:
Nice visualisations 😮!
Nice work Matteo! I would also suggest going for the threshold of 0.4 to have a balanced distribution.
After what @martinezvelascojavier pointed out probably 0.4 isn't the best choice, I'm continuing the processing (but with the numpy masks this time) and the results were very different
In the last commit I started the processing of masks, but with some unexpected results:
The more I look into the labels, the more confused I get too haha. It looks like channel 1 and channel 3 are almost the same? Even though channel 1 supposedly shows clean ice whereas channel 3 supposedly shows whether the pixel is in HKH or China.
If you look at the image below, channel 3 (purple = in HKH) is basically the inverse of channel 1 (purple = glacier), but it doesn't make sense. Why would all the glacier parts be in China? Maybe the README
is actually wrong?
code
I only looked at 4-5 images, but it seems to me that there is almost no debris-covered glacier? This is also what you observe @Mamiglia when you say that the 1st and 2nd are very similar to the 3rd? The 2nd channel probably only adds very little information because there is not a lot of debris covered glacier 🤔
Edit: In the paper it also says that there is only a small amount of debris covered glaciers in the data:
Debriscovered glaciers are more similar to the background, often leading to false negatives. Debris-covered glaciers are also much rarer.
Maybe we should just keep the first two and ignore the 3rd?
Yeah I say we should totally ditch the third.
Btw what the README
says is that the third channel is a
mask of whether the pixel belongs to the HKH region (glaciers outside the HKH region, e.g., those in China, were not annotated)
Which isn't entirely clear to me
Based on the mask in the first and second channels I have created the labels for the image. I have stored them in the text file which is now present in the dataset folder. They are indexed in the same format (kind of) as we did while creating the processed meta_data.
CHANNEL 0: CLEAN GLACIER CHANNEL 1: DEBRIS GLACIER CHANNEL 2: COMPLEMENT OF THE INTERSECTION OF CHANNEL 0 AND CHANNEL 1
** PRE-PROCESSING: TRANSFORMED DATA: HISTOGRAMS,
**LINEAR CLASSIFIER TO USE:
****DEVELOP CNN
Apparently for some reasons there are some completely out of scale values in the matrices:
This is the 1st channel of the splits
As you can see there are some extreme values, but when removed the distribution is quite normal
So the values in different spectral channel that we have are already quantised in 256 levels(8-bit), so I think we do not have to normalise the values in pre-processing stage. If normalisation is required in modelling stage(depending on the algorithm requirements) they you can go for it. I think the stat's provided on 1st channel previously are maybe due to the fact of presence of NaN values.
Summary of values for each channel
I find quite strange that we have both integer value and negative ones, maybe we mixed up data from both sources?
Ok, for example: og_dataset/splits/dev/slice_3_img_056.npy contains non integer values that looks like they were normalized
Insted og_dataset/splits/dev/slice_3_img_065.npy contains integer values only.
Also as far as I noticed the only images that contain non-integer values are those with > 10% of ice, meaning that probably the slices in the splits folder had normalized values, while those in the patches were not.
Things to do
check how many image patches contain "too much" China. China does not exist on the images.check if the parts of the image that are labelled as China (they are 0 in the 3rd channel of the mask), are also empty in the original image.Data
Some useful information about the data we are using, taken from the paper. Information about Landsat 7.