jonasbarth commented 1 year ago

Things to do

~~check how many image patches contain "too much" China. China does not exist on the images~~.
~~check if the parts of the image that are labelled as China (they are 0 in the 3rd channel of the mask), are also empty in the original image.~~
create a single binary label for each patch. In the mask, a 1 corresponds to glacier for channels 1 and 2. We need to choose a percentage of what we consider to be a glacier.

Data

Some useful information about the data we are using, taken from the paper. Information about Landsat 7.

Mamiglia commented 1 year ago

I was thinking about the preparation of the dataset, and I came up with this:

Preprocessing

Masks:
- sum debris and clean ice (ignore third layer)
- get a percentage of ice for each image (save it somewhere)
- plot a histogram with all the values of ice per image, analize it and try to pick a good threshold for positives and negatives
Splits:
- identify which channel is what
  - discard channels if they are completely useless
- compute histogram for each of channels (save it)
- compute histogram of gradient (for each of the channels?), save it
  - histogram of gradient by intensity?
  - by angle?
- pick an appropriate number for the bins for each histogram
  - write a function that applies such bins for each of the histograms in such a way that we can easily pick and change this hyperparameter later
- also paste maybe some metadata from the JSON into it
- paste everything into an np array of size $b \cdot C + m$, where:
  - $b :=$ number of bins,
  - $C :=$ number of channels
  - $m :=$ number of metadata features added
- be sure to write down everything about the structure of the final np arrays, such that everything is clear about each of the elements of the array

jonasbarth commented 1 year ago

Thanks, that is super detailed and useful! These are the 15 channels that we have. source

LE7 B1 (blue)
LE7 B2 (green)
LE7 B3 (red)
LE7 B4 (near infrared)
LE7 B5 (shortwave infrared 1)
LE7 B6_VCID_1 (low-gain thermal infrared)
LE7 B6_VCID_2 (high-gain thermal infrared)
LE7 B7 (shortwave infrared 2)
LE7 B8 (panchromatic)
LE7 BQA (quality bitmask)
NDVI (vegetation index)
NDSI (snow index)
NDWI (water index)
SRTM 90 elevation
SRTM 90 slope

Mamiglia commented 1 year ago

Following last commit I obtained the following graphs:

China percentage for each slice of the trainset:

Average Glacier content for each of the slices in trainset:

Mamiglia commented 1 year ago

Now:

is china still problematic since most of the values are under 10%?
What's the optimal value for the threshold?
- 0.4 to pick the median such as that we will obtain a 50/50 ratio of glaciers vs non-glaciers
- 0.3 to pick a value that tells: yes, there's a glacier here: 25/75 ratio

jonasbarth commented 1 year ago

Nice visualisations 😮!

I think having < 10% of China should be fine. It means we will only have a small portion of the image that is basically blacked out. But @nem-42098 should confirm that 😄
I think with the 50/50 ratio we would have balanced classes which is usually good for classification, so my first intuition is to go for that.

nem-42098 commented 1 year ago

Nice work Matteo! I would also suggest going for the threshold of 0.4 to have a balanced distribution.

So my task, remains basically to create a new mask based on our threshold value and maybe further we can merge those in @Mamiglia drive link that he shared. -After which I would like to see the possible architecture(simple and accurate enough) we can use in CNN? -Somebody might start looking at ways we can extract features based on the edge detection process. One idea was suggested by @Mamiglia.

Mamiglia commented 1 year ago

After what @martinezvelascojavier pointed out probably 0.4 isn't the best choice, I'm continuing the processing (but with the numpy masks this time) and the results were very different

Mamiglia commented 1 year ago

In the last commit I started the processing of masks, but with some unexpected results:

Apparently some of my files are corrupted, as I can't load some of them (around 10%)
The combination of 1st and 2nd layer of the matrices is very similar to the mean of the third layer, and as I had interpreted this shouldn't be, maybe they don't mean what we thought?
The new result of the distribution of the mask_mean is the following: So maybe we should pick 0.2 as threshold

jonasbarth commented 1 year ago

The more I look into the labels, the more confused I get too haha. It looks like channel 1 and channel 3 are almost the same? Even though channel 1 supposedly shows clean ice whereas channel 3 supposedly shows whether the pixel is in HKH or China.

If you look at the image below, channel 3 (purple = in HKH) is basically the inverse of channel 1 (purple = glacier), but it doesn't make sense. Why would all the glacier parts be in China? Maybe the README is actually wrong? code

I only looked at 4-5 images, but it seems to me that there is almost no debris-covered glacier? This is also what you observe @Mamiglia when you say that the 1st and 2nd are very similar to the 3rd? The 2nd channel probably only adds very little information because there is not a lot of debris covered glacier 🤔

Edit: In the paper it also says that there is only a small amount of debris covered glaciers in the data:

Debriscovered glaciers are more similar to the background, often leading to false negatives. Debris-covered glaciers are also much rarer.

Question:

Maybe we should just keep the first two and ignore the 3rd?

Mamiglia commented 1 year ago

Yeah I say we should totally ditch the third.

Btw what the README says is that the third channel is a

mask of whether the pixel belongs to the HKH region (glaciers outside the HKH region, e.g., those in China, were not annotated)

Which isn't entirely clear to me

nem-42098 commented 1 year ago

Based on the mask in the first and second channels I have created the labels for the image. I have stored them in the text file which is now present in the dataset folder. They are indexed in the same format (kind of) as we did while creating the processed meta_data.

martinezvelascojavier commented 1 year ago

CHANNEL 0: CLEAN GLACIER CHANNEL 1: DEBRIS GLACIER CHANNEL 2: COMPLEMENT OF THE INTERSECTION OF CHANNEL 0 AND CHANNEL 1

martinezvelascojavier commented 1 year ago

** PRE-PROCESSING: TRANSFORMED DATA: HISTOGRAMS,

DIFFERENT PREPROCESSING DEPENDING TO THE CLASSIFIER THAT IS GONNA BE USED ** BAND SELECTION BY USING KNN, RANDOM FOREST, FEATURES THAT PROVIDE BEST PERFORMANCE, DIGITIZE PIXEL VALUES HISTOGRAMS OF COLORS**: CREATE A FUNCTION WITH Nº BINS AS PARAMETER. FOR EVERY ONE OF THE 15 BANDS WE WILL NEED TO FIND THE MOST SUITABLE NºBINS. REGARDIN THE GRADIENTS

**LINEAR CLASSIFIER TO USE:

KNN PERFORMANCE
LOGISTIC REGRESSION
NAIVE BAYES
GAUSSIAN DISCRIMINANT ANALYSIS
NEURAL NETWORKS

****DEVELOP CNN

Mamiglia commented 1 year ago

Apparently for some reasons there are some completely out of scale values in the matrices:

This is the 1st channel of the splits

As you can see there are some extreme values, but when removed the distribution is quite normal

jonasbarth commented 1 year ago

Channels to use for all classifiers:

RGB
6-8
11-13
PCA (on all 15 channels)

nem-42098 commented 1 year ago

So the values in different spectral channel that we have are already quantised in 256 levels(8-bit), so I think we do not have to normalise the values in pre-processing stage. If normalisation is required in modelling stage(depending on the algorithm requirements) they you can go for it. I think the stat's provided on 1st channel previously are maybe due to the fact of presence of NaN values.

Mamiglia commented 1 year ago

Summary of values for each channel immagine

I find quite strange that we have both integer value and negative ones, maybe we mixed up data from both sources?

Mamiglia commented 1 year ago

Ok, for example: og_dataset/splits/dev/slice_3_img_056.npy contains non integer values that looks like they were normalized

Insted og_dataset/splits/dev/slice_3_img_065.npy contains integer values only.

Also as far as I noticed the only images that contain non-integer values are those with > 10% of ice, meaning that probably the slices in the splits folder had normalized values, while those in the patches were not.

jonasbarth / fds-2022-final-project

Preparing data #4

Things to do

Data

Preprocessing

Question:

Channels to use for all classifiers: