Closed lindsayplatt closed 9 months ago
First up, all the Flow
related static attributes I have. In #54, I added some more "watershed metrics" (aka different flow statistics). Plotting the distributions between the different site categorizations using a log scale (see below), reveals that they are all fairly similar. Below that I tried plotting each variable against the other and we see mostly straight lines, meaning their relationship is really 1:1. For these reasons, I am going to keep only the medianFlow
attribute and remove all others.
Next up Snow
! These three snow attributes are pretty similar (used the same process above to evaluate these). As such, I am only going to keep avgSnow
.
Now for info related to roads. I have both roadDensity
and roadStreamXings
. I don't think I need both, so I am going to only keep roadDensity
. Not really a 1:1 relationship but that's OK. Since the distributions are generally similar, I don't think we will get much more value out of roadStreamXings
compared to roadDensity
.
Now to evaluate the land cover attributes. To start, I only included pctLowDev
and pctHighDev
which are the land covers where the impervious surface is between 20-49%
and 80-100%
, respectively. This is missing some fairly critical land use types, so I went ahead and added the rest of the land uses to the download and will hopefully add together into categories that make a bit more sense.
Looks like yes (all add to 100)! Note that even though the category is shown, none of the catchments included here have values greater than 0 for permanent ice and snow land use.
To start, I will not be including the following because the percentages are so low (all but one site is below 10% for each of these land uses): barren land (CAT_NLCD19_31
), permanent snow/ice (CAT_NLCD19_12
), grassland (CAT_NLCD19_71
), and shrub (CAT_NLCD19_52
). Then, I will combine using the following groups:
CAT_NLCD19_11
) -- leaving as-isCAT_NLCD19_41
) + evergreen forest (CAT_NLCD19_42
) + mixed forest (CAT_NLCD19_43
)CAT_NLCD19_95
) + woody wetland (CAT_NLCD19_90
) CAT_NLCD19_81
) + cropland (CAT_NLCD19_82
) CAT_NLCD19_21
) + low development (CAT_NLCD19_22
)CAT_NLCD19_23
) + high development (CAT_NLCD19_24
)Here are the boxplot distributions and bar charts from before but with these new categories.
Following up on land use categories, I dropped them from 16 to just 6 BUT I still think I should combine the impervious surface land uses. The definitions of all 4 are below and I think the important thing is that they all represent land uses where there are non-negligible impervious surfaces and are considered "developed".
Show by category:
Lastly, there were a few that I determined can be summarized by a different attribute -or- just don't give us much information because the distributions are quite similar across categories (or they don't make too much scientific sense).
pctSoilClay
, pctSoilSilt
, and pctSoilSand
. We will use soilPerm
(soil permeability) in place of these 3 because that is really what I was trying to reveal by including these soil type metricsavgStreamSlope
, meanSoilSalinity
, numDams2013
, and pctSoilOM
. None of these show much difference in attributes among sitesAs of 2/21, I am here with the attributes but still have some to check, such as
I also need to revisit stream density, despite one value missing.
For the vegetation indices, the definitions area available here and correspond to "enhanced vegetation index" from MODIS EVI where EVI is a remote sensing metric used to quantify vegetation greenness. This is not actually very useful to us in this context and the land cover type is really want we want, so removing them.
Deleting:
vegIndSpring
(CAT_EVI_AMJ_2012)vegIndSummer
(CAT_EVI_JAS_2012)vegIndAutumn
(CAT_EVI_OND_2011)vegIndWinter
(CAT_EVI_JFM_2012)For precip vs runoff, they look very similary and I also tried calculating the ratio of precip to runoff but end up with very similar distributions across all three. So, I think I will keep the ratio so that both precip & runoff are included.
Deleting: avgPrecip
and avgRunoff
but keeping precipRunoffRatio
.
More on runoff, I did try using the monthly runoff NHD attributes (and averaging by season). I think I will keep these seasonal runoff attributes for now until after running some of the models because they seem interesting and may matter.
Adding:
avgRunoffFall
avgRunoffSpring
avgRunoffSummer
avgRunoffWinter
Here is how I calculated seasonal runoff using new NHD+ attributes:
mutate(attr_avgRunoffWinter = mean(c(CAT_WB5100_DEC,CAT_WB5100_JAN, CAT_WB5100_FEB, CAT_WB5100_MAR)),
attr_avgRunoffSpring = mean(c(CAT_WB5100_APR, CAT_WB5100_MAY)),
attr_avgRunoffSummer = mean(c(CAT_WB5100_JUN, CAT_WB5100_JUL, CAT_WB5100_AUG)),
attr_avgRunoffFall = mean(c(CAT_WB5100_SEP, CAT_WB5100_OCT, CAT_WB5100_NOV)))
Rather than using the snow from the Water Balance Model (WBM), I am going to use % snow * annual precip from the climate section.
They are fairly similar (left = CAT_WBM_SNW
, right = CAT_PPT7100_ANN
* CAT_PRSNOW
)
Topographic wetness index seems rathe complex (see here), so I am going to use other attributes to get at the idea of infiltration.
Here are all the remaining "groundwater" related attributes. I don't think we need this many. I think it would be nice to stick to as few sources as possible, so I may eliminate the attributes from Zell and Sanford 2020 since there are other options within the NHD+ static attributes.
Deleting:
transmissivity
: Transmissivity is the "the depth-integrated hydraulic conductivity" and describes "ease of flow" of water through the subsurface. Soil permeability is also doing this. Probably should choose one of these but they aren't exactly similar. I think I will choose soil permeability because it is in units that I understand and map to infiltration more easily (inches per hour), plus the patterns align more with the other variables where baseflow category sites look more distinct from the others, which is not true of transmissivity.avgSoilStorage
: I am going to eliminate soil moisture storage (CAT_WBM_STO
) since that one is similar in idea to available water capacity, where it has to do with water available in the root zone for plant uptake. I don't think this is as relevant to our question.I don't know what to do with choosing which depth2WT?
Removing stream density which just isn't very interesting or meaningful for this application
Needing to decide which area and salt related attributes to use. Do not need all 6. Leaning away from the ratios and also not planning to use areaSqKm
, which is just the individual COMID areas because COMID catchments are specifically made to be similar in size, so doubtful those would tell us much. Note that I changed some of the names in order to get the labels to appear on the resulting plots. See code for those details.
Results from this:
roadSaltCOMID
and roadSaltCumulative
were highly correlated (spearman's rho = 0.83)roadSaltCOMID
and areaCumulative
were not really correlated (spearman's rho = -0.19)roadSaltCumulative
and areaCumulative
were somewhat correlated (spearman's rho = -0.45)Given these results, I think we can move forward with using both roadSaltCOMID
and areaCumulative
because they were not really correlated with each other but both were at least somewhat correlated with roadSaltCumulative
(so we don't need that one because it would be less likely to add information to our model).
Used a Spearman Rank Correlation to compare how correlated these different attributes are. Followed along with the blogs here and here.
Correlation matrix:
areaCumulative areaCOMID areaRatio roadSaltCumulative roadSaltCOMID roadSaltRatio
areaCumulative 1.0000000 0.25540600 -0.8267446 -0.4513342 -0.19378503 -0.7529419
areaCOMID 0.2554060 1.00000000 0.2468535 -0.1420524 -0.01076636 0.3324710
areaRatio -0.8267446 0.24685355 1.0000000 0.3736243 0.17799200 0.9484655
roadSaltCumulative -0.4513342 -0.14205237 0.3736243 1.0000000 0.82695146 0.3747887
roadSaltCOMID -0.1937850 -0.01076636 0.1779920 0.8269515 1.00000000 0.3213864
roadSaltRatio -0.7529419 0.33247099 0.9484655 0.3747887 0.32138642 1.0000000
Trying to find an attribute that gets at "winter severity". Currently comparing "winter duration" based on days between first/last freeze day and average winter air temperature (mean of monthly air temperatures between December and March). Below are the distributions of these attributes along with snow:
Results from this:
avgWinterDuration
and avgWinterAirTemp
were highly correlated (spearman's rho = -0.75)avgWinterDuration
and avgSnow
were pretty well correlated (spearman's rho = 0.63)avgWinterAirTemp
and avgSnow
were pretty well correlated (spearman's rho = -0.68)Given these results, I will move forward with using both avgWinterAirTemp
and avgSnow
. Needed to choose between either avgWinterDuration
OR avgWinterAirTemp
because they were correlated with each other and temperature makes more ecological sense because it would control whether salt was used during a snowfall event because salt is not effective below a certain temperature, and would also control whether precip fell as rain, freezing rain, or snow (which could be a more informative distinction between categories in the random forest model than simply the number of days between first and last freezes).
Ran Spearman Rank Correlation for these three winter-related attributes:
Correlation matrix:
avgSnow avgWinterAirTemp avgWinterDuration
avgSnow 1.0000000 -0.6780457 0.6339131
avgWinterAirTemp -0.6780457 1.0000000 -0.7535945
avgWinterDuration 0.6339131 -0.7535945 1.0000000
Added transmissivity back into the mix and now checking correlations between "final" list of 17 attributes.
Results from this when considering just those that are highly correlated (I am using >= abs(0.70)
):
medianFlow
and areaCumulativeSqKm
= 0.89 (more area = more runoff = larger rivers; probably don't need cumulative area since these are so highly correlated)baseFlowInd
and avgGWRecharge
= 0.79 (it makes sense that these are correlated but I think I want to keep both because ecologically water may be infiltrating as GW Recharge but not making its way to the streamavgGWRecharge
and avgSnow
= 0.78 (this may be a cool connection but I think these are ecologically different enough to keep both)pctForested
and pctDeveloped
= -0.73 (this makes sense but I want to keep all land use percents separately)areaCumulativeSqKm
and pctOpenWater
= 0.71 (similar to medianFlow and watershed area, larger upstream watershed = bigger river, but I want to keep all percentages and am considering removing watershed area based on the first bullet)avgBasinSlope
and pctForested
= 0.71 (makes sense because more forest would be in mountainous areas but could also be forested in flat regions, so going to keep both)Final decision from this activity is to remove areaCumulativeSqKm
since it is highly correlated to medianFlow
and pctOpenWater
(more upstream watershed area = bigger rivers). This leaves 16 attributes for the final collection.
Up til now, I've largely just left the initially chosen assortment of NHD+ catchment attributes as-is (see the file: https://github.com/lindsayplatt/salt-modeling-data/blob/main/1_Download/in/nhdplus_attributes.yml). This issue will capture my thought process for eliminating, adding, or combining attributes into the final set for the random forest models.
Here are all the attributes we are using now: