Quick visualization to compare classes in our data set

jrforrest commented 6 years ago

Trying to dig into why our models are so inaccurate so far, I figured quick visualizations of our data may give us a little insight into what we can optimize. I suggested histograms, but after some consideration I think we probably care less about value distribution in each class than actual values of each feature. For the sake of viewing many classes alongside eachother, I'm just using mean and median values for each feature for a given class to give us a quick high-level look at how features compare to eachother.

I'm really uncertain if this is a suitable approach to better understanding how our classes compare, but the visualizations produced here seem to indicate a lack of distinct shapes in the data for our various classes. Here, I've plotted mean and median for the entire available data set both before and after scaling.

scaled_median scaled_mean unscaled_median

Let me know if you guys have any insights on this, or want some additional visualizations (they will be easy to generate now that the plumbing work is done.) I figured one heatmap per class with value on the Y and feature name on the X may be another good high-level breakdown of how different our classes actually are in the data.

KMurph07 commented 6 years ago

I gained access to a higher processing computer through work and think it may be better to do the original plan of processing all 426 bands, rather than just the 32, and then splicing out which ones appear to distinguish the NLCD classes best. Also, after learning more about the orthorectification side of things, I don't think the eastern side of the site should be used for training data anymore due to cloud interference on the spectrometer data. I've created a new training dataset for deciduous forest since that was the class affected and will get their spectral signatures after I finish running the cleaning function on the 426 bands. I'm not sure how that will affect things after we fit the machine learning model and then extrapolate it out for the site, but that seems like a problem for our future selves.

Also, I think this is what you were mentioning the other day about altering scaling, but it might be beneficial to look at changing the pixel size to 5x5 m or 10x10 m. My only concern with that is that the evergreen trees, developed areas, and water areas are scarce and sporadic enough that it would no longer be a pure pixel or would significantly decrease an already small sample size. I don't know if this is kosher or not, but could we make the pixel size larger for deciduous forest, shrub / scrub, and pasture hay while keeping it at 1x1 m for evergreen forest, open water, and high intensity developed space? That might get messy later on as well, but I'm curious if the small pixel size is picking up on micro differences and we are missing out on the bigger picture defining characteristics.

jrforrest commented 6 years ago

I gained access to a higher processing computer through work and think it may be better to do the original plan of processing all 426 bands, rather than just the 32, and then splicing out which ones appear to distinguish the NLCD classes best.

I'm not too sure but I'm thinking 426 is probably going to be too many features for anything but our SVM and MLP models, RF for example may require some dimensionality reduction. However, starting with 426 features in the CSV data set means I can toy with a couple methods of dimensionality reduction and perhaps we can do something that will produce more variance among the reduced features than the sampling that was used to generate the current batch of CSVs.

Also, if the tooling used to generate these data points is falling over with that many features, it's possible I could help optimize whatever we're using there (Python scripts I'm hoping.)

Also, after learning more about the orthorectification side of things, I don't think the eastern side of the site should be used for training data anymore due to cloud interference on the spectrometer data.

I know basically nothing about any of this but my layman's hunch is that cloud interference seems like a very likely explanation for the lack of variability we're seeing in spectral data. I've noted from some manual examination that some pixels in the current CSV set do seem to have fairly unique spectral signatures, but I'm thinking we sampled enough pixels for each class that had enough interference to skew the models away from learning on those more unique signatures.

If this is correct (which I'm not sure about at all) then that means we could potentially work around that problem by filtering out those pixels which have the homogenous spectral signature we're seeing in those plots, but the given the already small size of the dataset we're working with here I'm guessing you're on the right track in just trying to pull our sampling from areas of the site that hopefully don't have interference.

I've created a new training dataset for deciduous forest since that was the class affected

Oh interesting, I figured that because all of the classes seem to have fairly similar spectral signatures (by mean and median values at least) that maybe more than a single class was affected by cloud cover or other interference.

I'm not sure how that will affect things after we fit the machine learning model and then extrapolate it out for the site, but that seems like a problem for our future selves.

Sorry I don't think I'm totally sure what you're getting at here. What's the worry with fitting the model to a cleaner dataset for deciduous forest? Are you thinking that the reflectance data we're using for the new sampling you're taking won't be fairly representative of the rest of the site, so that our trained models will be unable to accurately classify pixels from other locations?

Also, I think this is what you were mentioning the other day about altering scaling

Oh I think I was talking about feature scaling [1] for training models like the SVM. That just means taking the numbers in our feature data set and scaling the individual values into a smaller range to keep model accurate without needing to alter the kernel function.

but it might be beneficial to look at changing the pixel size to 5x5 m or 10x10 m. ... I'm curious if the small pixel size is picking up on micro differences and we are missing out on the bigger picture defining characteristics.

Ah! This seems like something really worth looking into. Maybe there's enough noise in a 1m pixel that we're getting a lot of spectral similarity between classes that a reduction in spatial resolution would normalize out a bit. Again, I'm way out of my area of expertise here but this strikes me as something to be really worth fiddling with.

Do we have an easy (automated) way of fiddling with spatial resolution when we take our samplings? Maybe I can help with some tooling there?

My only concern with that is that the evergreen trees, developed areas, and water areas are scarce and sporadic enough that it would no longer be a pure pixel or would significantly decrease an already small sample size.

Yeah I definitely share this concern. I don't think we should reduce the pixel size below the space that these things tend to occupy. Unless there's some very clear and necessary benefit to resolution reduction I think we should stay away from that.

I don't know if this is kosher or not, but could we make the pixel size larger for deciduous forest, shrub / scrub, and pasture hay while keeping it at 1x1 m for evergreen forest, open water, and high intensity developed space?

Oh interesting idea. I'm concerned that we'll be unable to determine which pixel size to use for comparison on the data we want to classify going forward; since we won't know the class prior to classification with our models we won't know which pixel size to use as input there. There may be a way to work around this but maybe the additional complexity isn't worth it just yet? Starting with the simple approach of a single pixel size might be better for now.

My takeway from this so far is that for our next steps we should try to better avoid areas of interference for starters to see if that provides more distinct spectral signatures for our various classes of pixels, up our feature count in our training/test data so we can try other methods of dimensionality reduction for the models that require it and run with the full 426 for those that don't, and then if that doesn't provide satisfactory results we should probably try fiddling with spatial resolution. I can try to lend a hand with that stuff, especially if there's scripts used to generate this stuff that could just use some optimizing. I could also probably help with spatial resolution reduction techniques since I'm used to that problem from image processing. If there's nothing I can do to help there though I'll happily sit back and wait for more CSVs :)

[1] https://en.wikipedia.org/wiki/Feature_scaling

jrforrest / aop_nlcd_classification

Quick visualization to compare classes in our data set #1