Evaluating and selecting appropriate static attributes

lindsayplatt commented 9 months ago

Up til now, I've largely just left the initially chosen assortment of NHD+ catchment attributes as-is (see the file: https://github.com/lindsayplatt/salt-modeling-data/blob/main/1_Download/in/nhdplus_attributes.yml). This issue will capture my thought process for eliminating, adding, or combining attributes into the final set for the random forest models.

Here are all the attributes we are using now:

lindsayplatt commented 9 months ago

First up, all the Flow related static attributes I have. In #54, I added some more "watershed metrics" (aka different flow statistics). Plotting the distributions between the different site categorizations using a log scale (see below), reveals that they are all fairly similar. Below that I tried plotting each variable against the other and we see mostly straight lines, meaning their relationship is really 1:1. For these reasons, I am going to keep only the medianFlow attribute and remove all others.

Code for these two plots

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, ends_with('Flow')) %>% mutate(across(ends_with("Flow"), log10)) %>% rename_with(~ paste0("log_", .x), ends_with("Flow")) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) # All the flow values are pretty much showing the same thing. site_attributes %>% mutate(across(ends_with("Flow"), log10)) %>% rename_with(~ paste0("log_", .x), ends_with("Flow")) %>% select(starts_with('log')) %>% plot() ```

lindsayplatt commented 9 months ago

Next up Snow! These three snow attributes are pretty similar (used the same process above to evaluate these). As such, I am only going to keep avgSnow.

Code for these two snow plots

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, contains("Snow")) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) # Average snow will be used in place of pctSnow and avgSnowDetrended (See these 1:1 plots) site_attributes %>% select(avgSnow, avgSnowDetrended, pctSnow) %>% plot() ```

lindsayplatt commented 9 months ago

Now for info related to roads. I have both roadDensity and roadStreamXings. I don't think I need both, so I am going to only keep roadDensity. Not really a 1:1 relationship but that's OK. Since the distributions are generally similar, I don't think we will get much more value out of roadStreamXings compared to roadDensity.

The code again

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, roadDensity, roadStreamXings) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) site_attributes %>% select(roadDensity, roadStreamXings) %>% plot() ```

lindsayplatt commented 9 months ago

Now to evaluate the land cover attributes. To start, I only included pctLowDev and pctHighDev which are the land covers where the impervious surface is between 20-49% and 80-100%, respectively. This is missing some fairly critical land use types, so I went ahead and added the rest of the land uses to the download and will hopefully add together into categories that make a bit more sense.

First, I wanted to check that these were actually %s and could be added.

Looks like yes (all add to 100)! Note that even though the category is shown, none of the catchments included here have values greater than 0 for permanent ice and snow land use.

Then, I just wanted to display land use distribution by the site category

Finally, I am combining some of these into a smaller number of categories

To start, I will not be including the following because the percentages are so low (all but one site is below 10% for each of these land uses): barren land (CAT_NLCD19_31), permanent snow/ice (CAT_NLCD19_12), grassland (CAT_NLCD19_71), and shrub (CAT_NLCD19_52). Then, I will combine using the following groups:

pctOpenWater = open water (CAT_NLCD19_11) -- leaving as-is
pctForested = deciduous forest (CAT_NLCD19_41) + evergreen forest (CAT_NLCD19_42) + mixed forest (CAT_NLCD19_43)
pctWetland = herbaceous wetland (CAT_NLCD19_95) + woody wetland (CAT_NLCD19_90)
pctAgriculture = pasture/hay (CAT_NLCD19_81) + cropland (CAT_NLCD19_82)
pctImperviousLow = open development (CAT_NLCD19_21) + low development (CAT_NLCD19_22)
pctImperviousHigh = medium development (CAT_NLCD19_23) + high development (CAT_NLCD19_24)

Here are the boxplot distributions and bar charts from before but with these new categories.

The all-important code

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) # Download locally first local_file <- 'metadata_table.tsv' download.file('https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/5669a79ee4b08895842a1d47/metadata_table.tsv', destfile = local_file) nhdplus_descriptions <- read_tsv(local_file) %>% split(.$themeLabel) %>% map(~{ # CAT = catchment # ACC = accumulated # TOT = total # Using `CAT` dplyr::select(.x, ID, description) %>% filter(grepl('^CAT_', ID)) }) # Load NHD+ definitions for NLCD nhdplus_descriptions[['Land Cover']] %>% filter(grepl('CAT_NLCD19_', ID)) # Verify that these are, in fact, percentages that could be summed. land_cover_attributes_sites <- site_attributes %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, starts_with('pct')) %>% pivot_longer(-site_no, names_to = 'land use', values_to = 'percent catchment, %') %>% mutate(`land use` = gsub('pct', '', `land use`)) ggplot(land_cover_attributes_sites, aes(x = site_no, y = `percent catchment, %`, fill=`land use`)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) land_cover_attributes_sitecat <- site_attributes %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_category_fact, starts_with('pct')) %>% pivot_longer(-site_category_fact, names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) land_cover_attributes_sitecat_combined <- site_attributes %>% mutate(pctForested = pctForestDecid + pctForestEverg + pctForestMixed, pctWetland = pctWetlandWoody + pctWetlandHerbaceous, pctAgriculture = pctAgPasture + pctAgCrop, pctImperviousHigh = pctOpenDev + pctLowDev, pctImperviousLow = pctMediumDev + pctHighDev) %>% select(-c(pctPermIceSnow, pctBarrenLand, pctShrub, pctGrassland, pctForestDecid, pctForestEverg, pctForestMixed, pctWetlandWoody, pctWetlandHerbaceous, pctAgPasture, pctAgCrop, pctOpenDev, pctLowDev, pctMediumDev, pctHighDev)) %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, site_category_fact, starts_with('pct')) %>% pivot_longer(-c(site_no, site_category_fact), names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat_combined, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ggplot(land_cover_attributes_sitecat_combined, aes(x = site_no, y = value, fill=landuse)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) ```

lindsayplatt commented 9 months ago

Following up on land use categories, I dropped them from 16 to just 6 BUT I still think I should combine the impervious surface land uses. The definitions of all 4 are below and I think the important thing is that they all represent land uses where there are non-negligible impervious surfaces and are considered "developed".

Show by category:

Code, of course!

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) land_cover_attributes_sitecat_combined_again <- site_attributes %>% mutate(pctForested = pctForestDecid + pctForestEverg + pctForestMixed, pctWetland = pctWetlandWoody + pctWetlandHerbaceous, pctAgriculture = pctAgPasture + pctAgCrop, pctDeveloped = pctOpenDev + pctLowDev + pctMediumDev + pctHighDev) %>% select(-c(pctPermIceSnow, pctBarrenLand, pctShrub, pctGrassland, pctForestDecid, pctForestEverg, pctForestMixed, pctWetlandWoody, pctWetlandHerbaceous, pctAgPasture, pctAgCrop, pctOpenDev, pctLowDev, pctMediumDev, pctHighDev)) %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, site_category_fact, starts_with('pct')) %>% pivot_longer(-c(site_no, site_category_fact), names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_no, y = value, fill=landuse)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_no, y = value, fill=landuse)) + geom_bar(width = 0.75, stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) + facet_wrap(vars(site_category_fact), scales = 'free_x') ```

lindsayplatt commented 9 months ago

Lastly, there were a few that I determined can be summarized by a different attribute -or- just don't give us much information because the distributions are quite similar across categories (or they don't make too much scientific sense).

Delete pctSoilClay, pctSoilSilt, and pctSoilSand. We will use soilPerm (soil permeability) in place of these 3 because that is really what I was trying to reveal by including these soil type metrics
Delete avgStreamSlope, meanSoilSalinity, numDams2013, and pctSoilOM. None of these show much difference in attributes among sites

lindsayplatt commented 9 months ago

As of 2/21, I am here with the attributes but still have some to check, such as

vegetative indices,
precip vs runoff,
available water capacity vs soil storage,
which depth to water table to use, and the
topographic wetness index

I also need to revisit stream density, despite one value missing.

lindsayplatt commented 9 months ago

For the vegetation indices, the definitions area available here and correspond to "enhanced vegetation index" from MODIS EVI where EVI is a remote sensing metric used to quantify vegetation greenness. This is not actually very useful to us in this context and the land cover type is really want we want, so removing them.

Deleting:

vegIndSpring (CAT_EVI_AMJ_2012)
vegIndSummer (CAT_EVI_JAS_2012)
vegIndAutumn (CAT_EVI_OND_2011)
vegIndWinter (CAT_EVI_JFM_2012)

lindsayplatt commented 9 months ago

For precip vs runoff, they look very similary and I also tried calculating the ratio of precip to runoff but end up with very similar distributions across all three. So, I think I will keep the ratio so that both precip & runoff are included.

Deleting: avgPrecip and avgRunoff but keeping precipRunoffRatio.

Code for the plots

```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% mutate(precipRunoffRatio = avgRunoff/avgPrecip) %>% select(site_category_fact, avgPrecip, avgRunoff, precipRunoffRatio) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ```

lindsayplatt commented 9 months ago

More on runoff, I did try using the monthly runoff NHD attributes (and averaging by season). I think I will keep these seasonal runoff attributes for now until after running some of the models because they seem interesting and may matter.

Adding:

avgRunoffFall
avgRunoffSpring
avgRunoffSummer
avgRunoffWinter

Here is how I calculated seasonal runoff using new NHD+ attributes:

mutate(attr_avgRunoffWinter = mean(c(CAT_WB5100_DEC,CAT_WB5100_JAN, CAT_WB5100_FEB, CAT_WB5100_MAR)),
           attr_avgRunoffSpring = mean(c(CAT_WB5100_APR, CAT_WB5100_MAY)),
           attr_avgRunoffSummer = mean(c(CAT_WB5100_JUN, CAT_WB5100_JUL, CAT_WB5100_AUG)),
           attr_avgRunoffFall = mean(c(CAT_WB5100_SEP, CAT_WB5100_OCT, CAT_WB5100_NOV)))

lindsayplatt commented 9 months ago

Rather than using the snow from the Water Balance Model (WBM), I am going to use % snow * annual precip from the climate section.

They are fairly similar (left = CAT_WBM_SNW, right = CAT_PPT7100_ANN * CAT_PRSNOW)

lindsayplatt commented 9 months ago

Topographic wetness index seems rathe complex (see here), so I am going to use other attributes to get at the idea of infiltration.

lindsayplatt commented 9 months ago

Here are all the remaining "groundwater" related attributes. I don't think we need this many. I think it would be nice to stick to as few sources as possible, so I may eliminate the attributes from Zell and Sanford 2020 since there are other options within the NHD+ static attributes.

Deleting:

transmissivity: Transmissivity is the "the depth-integrated hydraulic conductivity" and describes "ease of flow" of water through the subsurface. Soil permeability is also doing this. Probably should choose one of these but they aren't exactly similar. I think I will choose soil permeability because it is in units that I understand and map to infiltration more easily (inches per hour), plus the patterns align more with the other variables where baseflow category sites look more distinct from the others, which is not true of transmissivity.
avgSoilStorage: I am going to eliminate soil moisture storage (CAT_WBM_STO) since that one is similar in idea to available water capacity, where it has to do with water available in the root zone for plant uptake. I don't think this is as relevant to our question.

lindsayplatt commented 9 months ago

I don't know what to do with choosing which depth2WT?

lindsayplatt commented 9 months ago

Removing stream density which just isn't very interesting or meaningful for this application

lindsayplatt commented 8 months ago

Needing to decide which area and salt related attributes to use. Do not need all 6. Leaning away from the ratios and also not planning to use areaSqKm, which is just the individual COMID areas because COMID catchments are specifically made to be similar in size, so doubtful those would tell us much. Note that I changed some of the names in order to get the labels to appear on the resulting plots. See code for those details.

Results from this:

roadSaltCOMID and roadSaltCumulative were highly correlated (spearman's rho = 0.83)
roadSaltCOMID and areaCumulative were not really correlated (spearman's rho = -0.19)
roadSaltCumulative and areaCumulative were somewhat correlated (spearman's rho = -0.45)

Given these results, I think we can move forward with using both roadSaltCOMID and areaCumulative because they were not really correlated with each other but both were at least somewhat correlated with roadSaltCumulative (so we don't need that one because it would be less likely to add information to our model).

Used a Spearman Rank Correlation to compare how correlated these different attributes are. Followed along with the blogs here and here.

Correlation matrix:

                   areaCumulative   areaCOMID  areaRatio roadSaltCumulative roadSaltCOMID roadSaltRatio
areaCumulative          1.0000000  0.25540600 -0.8267446         -0.4513342   -0.19378503    -0.7529419
areaCOMID               0.2554060  1.00000000  0.2468535         -0.1420524   -0.01076636     0.3324710
areaRatio              -0.8267446  0.24685355  1.0000000          0.3736243    0.17799200     0.9484655
roadSaltCumulative     -0.4513342 -0.14205237  0.3736243          1.0000000    0.82695146     0.3747887
roadSaltCOMID          -0.1937850 -0.01076636  0.1779920          0.8269515    1.00000000     0.3213864
roadSaltRatio          -0.7529419  0.33247099  0.9484655          0.3747887    0.32138642     1.0000000

Code to generate these correlation plots

```r # Run Spearman Rank Correlation to see if the area & salt attributes are related # so that we only use attributes that are showing different things. library(targets) library(tidyverse) library(GGally) attrs_area_salt <- tar_read(p6_site_attr_rf) %>% select(areaCumulative = areaCumulativeSqKm, areaCOMID = areaSqKm, areaRatio, roadSaltCumulative = roadSaltCumulativePerSqKm, roadSaltCOMID = roadSaltPerSqKm, roadSaltRatio) corr1$p.value < 0.05 plot(attrs_area_salt) cor(attrs_area_salt, method = "spearman") ggcorr(data = attrs_area_salt, method = c("everything", "spearman"), nbreaks = 5, low = "darkred", mid = "white", high = "seagreen4", label = TRUE, label_round = 2) + ggtitle('Spearman Rank Correlation for area and salt attributes') ```

lindsayplatt commented 8 months ago

Trying to find an attribute that gets at "winter severity". Currently comparing "winter duration" based on days between first/last freeze day and average winter air temperature (mean of monthly air temperatures between December and March). Below are the distributions of these attributes along with snow:

Results from this:

avgWinterDuration and avgWinterAirTemp were highly correlated (spearman's rho = -0.75)
avgWinterDuration and avgSnow were pretty well correlated (spearman's rho = 0.63)
avgWinterAirTemp and avgSnow were pretty well correlated (spearman's rho = -0.68)

Given these results, I will move forward with using both avgWinterAirTemp and avgSnow. Needed to choose between either avgWinterDuration OR avgWinterAirTemp because they were correlated with each other and temperature makes more ecological sense because it would control whether salt was used during a snowfall event because salt is not effective below a certain temperature, and would also control whether precip fell as rain, freezing rain, or snow (which could be a more informative distinction between categories in the random forest model than simply the number of days between first and last freezes).

Ran Spearman Rank Correlation for these three winter-related attributes:

Correlation matrix:

                     avgSnow avgWinterAirTemp avgWinterDuration
avgSnow            1.0000000       -0.6780457         0.6339131
avgWinterAirTemp  -0.6780457        1.0000000        -0.7535945
avgWinterDuration  0.6339131       -0.7535945         1.0000000

lindsayplatt commented 8 months ago

Added transmissivity back into the mix and now checking correlations between "final" list of 17 attributes.

Results from this when considering just those that are highly correlated (I am using >= abs(0.70)):

medianFlow and areaCumulativeSqKm = 0.89 (more area = more runoff = larger rivers; probably don't need cumulative area since these are so highly correlated)
baseFlowInd and avgGWRecharge = 0.79 (it makes sense that these are correlated but I think I want to keep both because ecologically water may be infiltrating as GW Recharge but not making its way to the stream
avgGWRecharge and avgSnow = 0.78 (this may be a cool connection but I think these are ecologically different enough to keep both)
pctForested and pctDeveloped = -0.73 (this makes sense but I want to keep all land use percents separately)
areaCumulativeSqKm and pctOpenWater = 0.71 (similar to medianFlow and watershed area, larger upstream watershed = bigger river, but I want to keep all percentages and am considering removing watershed area based on the first bullet)
avgBasinSlope and pctForested = 0.71 (makes sense because more forest would be in mountainous areas but could also be forested in flat regions, so going to keep both)

Final decision from this activity is to remove areaCumulativeSqKm since it is highly correlated to medianFlow and pctOpenWater (more upstream watershed area = bigger rivers). This leaves 16 attributes for the final collection.

lindsayplatt / salt-modeling-data