jonasbarth / fds-2022-final-project

Final Project for Fundamentals of Data Science 2022.
0 stars 2 forks source link

Preparing data #4

Closed jonasbarth closed 1 year ago

jonasbarth commented 1 year ago

Things to do

Data

Some useful information about the data we are using, taken from the paper. Information about Landsat 7.

Mamiglia commented 1 year ago

I was thinking about the preparation of the dataset, and I came up with this:

Preprocessing

jonasbarth commented 1 year ago

Thanks, that is super detailed and useful! These are the 15 channels that we have. source

  1. LE7 B1 (blue)
  2. LE7 B2 (green)
  3. LE7 B3 (red)
  4. LE7 B4 (near infrared)
  5. LE7 B5 (shortwave infrared 1)
  6. LE7 B6_VCID_1 (low-gain thermal infrared)
  7. LE7 B6_VCID_2 (high-gain thermal infrared)
  8. LE7 B7 (shortwave infrared 2)
  9. LE7 B8 (panchromatic)
  10. LE7 BQA (quality bitmask)
  11. NDVI (vegetation index)
  12. NDSI (snow index)
  13. NDWI (water index)
  14. SRTM 90 elevation
  15. SRTM 90 slope
Mamiglia commented 1 year ago

Following last commit I obtained the following graphs:

China percentage for each slice of the trainset: image

Average Glacier content for each of the slices in trainset: image

Mamiglia commented 1 year ago

Now:

jonasbarth commented 1 year ago

Nice visualisations 😮!

nem-42098 commented 1 year ago

Nice work Matteo! I would also suggest going for the threshold of 0.4 to have a balanced distribution.

Mamiglia commented 1 year ago

After what @martinezvelascojavier pointed out probably 0.4 isn't the best choice, I'm continuing the processing (but with the numpy masks this time) and the results were very different

Mamiglia commented 1 year ago

In the last commit I started the processing of masks, but with some unexpected results:

jonasbarth commented 1 year ago

The more I look into the labels, the more confused I get too haha. It looks like channel 1 and channel 3 are almost the same? Even though channel 1 supposedly shows clean ice whereas channel 3 supposedly shows whether the pixel is in HKH or China.

If you look at the image below, channel 3 (purple = in HKH) is basically the inverse of channel 1 (purple = glacier), but it doesn't make sense. Why would all the glacier parts be in China? Maybe the README is actually wrong? image code

I only looked at 4-5 images, but it seems to me that there is almost no debris-covered glacier? This is also what you observe @Mamiglia when you say that the 1st and 2nd are very similar to the 3rd? The 2nd channel probably only adds very little information because there is not a lot of debris covered glacier 🤔

Edit: In the paper it also says that there is only a small amount of debris covered glaciers in the data:

Debriscovered glaciers are more similar to the background, often leading to false negatives. Debris-covered glaciers are also much rarer.

Question:

Maybe we should just keep the first two and ignore the 3rd?

Mamiglia commented 1 year ago

Yeah I say we should totally ditch the third.

Btw what the README says is that the third channel is a

mask of whether the pixel belongs to the HKH region (glaciers outside the HKH region, e.g., those in China, were not annotated)

Which isn't entirely clear to me

nem-42098 commented 1 year ago

Based on the mask in the first and second channels I have created the labels for the image. I have stored them in the text file which is now present in the dataset folder. They are indexed in the same format (kind of) as we did while creating the processed meta_data.

martinezvelascojavier commented 1 year ago

CHANNEL 0: CLEAN GLACIER CHANNEL 1: DEBRIS GLACIER CHANNEL 2: COMPLEMENT OF THE INTERSECTION OF CHANNEL 0 AND CHANNEL 1

martinezvelascojavier commented 1 year ago

** PRE-PROCESSING: TRANSFORMED DATA: HISTOGRAMS,

**LINEAR CLASSIFIER TO USE:

****DEVELOP CNN

Mamiglia commented 1 year ago

Apparently for some reasons there are some completely out of scale values in the matrices:

image

This is the 1st channel of the splits

image

As you can see there are some extreme values, but when removed the distribution is quite normal

jonasbarth commented 1 year ago

Channels to use for all classifiers:

nem-42098 commented 1 year ago

image

So the values in different spectral channel that we have are already quantised in 256 levels(8-bit), so I think we do not have to normalise the values in pre-processing stage. If normalisation is required in modelling stage(depending on the algorithm requirements) they you can go for it. I think the stat's provided on 1st channel previously are maybe due to the fact of presence of NaN values.

Mamiglia commented 1 year ago

Summary of values for each channel immagine

I find quite strange that we have both integer value and negative ones, maybe we mixed up data from both sources?

Mamiglia commented 1 year ago

Ok, for example: og_dataset/splits/dev/slice_3_img_056.npy contains non integer values that looks like they were normalized

Insted og_dataset/splits/dev/slice_3_img_065.npy contains integer values only.

Also as far as I noticed the only images that contain non-integer values are those with > 10% of ice, meaning that probably the slices in the splits folder had normalized values, while those in the patches were not.