lindsayplatt / salt-modeling-data

Reproducible code for downloading, processing, and modeling data related to river salinization dynamics
0 stars 0 forks source link

Evaluating and selecting appropriate static attributes #56

Closed lindsayplatt closed 9 months ago

lindsayplatt commented 9 months ago

Up til now, I've largely just left the initially chosen assortment of NHD+ catchment attributes as-is (see the file: https://github.com/lindsayplatt/salt-modeling-data/blob/main/1_Download/in/nhdplus_attributes.yml). This issue will capture my thought process for eliminating, adding, or combining attributes into the final set for the random forest models.

Here are all the attributes we are using now:

image

lindsayplatt commented 9 months ago

First up, all the Flow related static attributes I have. In #54, I added some more "watershed metrics" (aka different flow statistics). Plotting the distributions between the different site categorizations using a log scale (see below), reveals that they are all fairly similar. Below that I tried plotting each variable against the other and we see mostly straight lines, meaning their relationship is really 1:1. For these reasons, I am going to keep only the medianFlow attribute and remove all others.

image

image

Code for these two plots ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, ends_with('Flow')) %>% mutate(across(ends_with("Flow"), log10)) %>% rename_with(~ paste0("log_", .x), ends_with("Flow")) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) # All the flow values are pretty much showing the same thing. site_attributes %>% mutate(across(ends_with("Flow"), log10)) %>% rename_with(~ paste0("log_", .x), ends_with("Flow")) %>% select(starts_with('log')) %>% plot() ```
lindsayplatt commented 9 months ago

Next up Snow! These three snow attributes are pretty similar (used the same process above to evaluate these). As such, I am only going to keep avgSnow.

image

image

Code for these two snow plots ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, contains("Snow")) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) # Average snow will be used in place of pctSnow and avgSnowDetrended (See these 1:1 plots) site_attributes %>% select(avgSnow, avgSnowDetrended, pctSnow) %>% plot() ```
lindsayplatt commented 9 months ago

Now for info related to roads. I have both roadDensity and roadStreamXings. I don't think I need both, so I am going to only keep roadDensity. Not really a 1:1 relationship but that's OK. Since the distributions are generally similar, I don't think we will get much more value out of roadStreamXings compared to roadDensity.

image

image

The code again ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% select(site_category_fact, roadDensity, roadStreamXings) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) site_attributes %>% select(roadDensity, roadStreamXings) %>% plot() ```
lindsayplatt commented 9 months ago

Now to evaluate the land cover attributes. To start, I only included pctLowDev and pctHighDev which are the land covers where the impervious surface is between 20-49% and 80-100%, respectively. This is missing some fairly critical land use types, so I went ahead and added the rest of the land uses to the download and will hopefully add together into categories that make a bit more sense.

First, I wanted to check that these were actually %s and could be added.

Looks like yes (all add to 100)! Note that even though the category is shown, none of the catchments included here have values greater than 0 for permanent ice and snow land use.

image

Then, I just wanted to display land use distribution by the site category

image

Finally, I am combining some of these into a smaller number of categories

To start, I will not be including the following because the percentages are so low (all but one site is below 10% for each of these land uses): barren land (CAT_NLCD19_31), permanent snow/ice (CAT_NLCD19_12), grassland (CAT_NLCD19_71), and shrub (CAT_NLCD19_52). Then, I will combine using the following groups:

Here are the boxplot distributions and bar charts from before but with these new categories.

image

image

The all-important code ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) # Download locally first local_file <- 'metadata_table.tsv' download.file('https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/5669a79ee4b08895842a1d47/metadata_table.tsv', destfile = local_file) nhdplus_descriptions <- read_tsv(local_file) %>% split(.$themeLabel) %>% map(~{ # CAT = catchment # ACC = accumulated # TOT = total # Using `CAT` dplyr::select(.x, ID, description) %>% filter(grepl('^CAT_', ID)) }) # Load NHD+ definitions for NLCD nhdplus_descriptions[['Land Cover']] %>% filter(grepl('CAT_NLCD19_', ID)) # Verify that these are, in fact, percentages that could be summed. land_cover_attributes_sites <- site_attributes %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, starts_with('pct')) %>% pivot_longer(-site_no, names_to = 'land use', values_to = 'percent catchment, %') %>% mutate(`land use` = gsub('pct', '', `land use`)) ggplot(land_cover_attributes_sites, aes(x = site_no, y = `percent catchment, %`, fill=`land use`)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) land_cover_attributes_sitecat <- site_attributes %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_category_fact, starts_with('pct')) %>% pivot_longer(-site_category_fact, names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) land_cover_attributes_sitecat_combined <- site_attributes %>% mutate(pctForested = pctForestDecid + pctForestEverg + pctForestMixed, pctWetland = pctWetlandWoody + pctWetlandHerbaceous, pctAgriculture = pctAgPasture + pctAgCrop, pctImperviousHigh = pctOpenDev + pctLowDev, pctImperviousLow = pctMediumDev + pctHighDev) %>% select(-c(pctPermIceSnow, pctBarrenLand, pctShrub, pctGrassland, pctForestDecid, pctForestEverg, pctForestMixed, pctWetlandWoody, pctWetlandHerbaceous, pctAgPasture, pctAgCrop, pctOpenDev, pctLowDev, pctMediumDev, pctHighDev)) %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, site_category_fact, starts_with('pct')) %>% pivot_longer(-c(site_no, site_category_fact), names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat_combined, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ggplot(land_cover_attributes_sitecat_combined, aes(x = site_no, y = value, fill=landuse)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) ```
lindsayplatt commented 9 months ago

Following up on land use categories, I dropped them from 16 to just 6 BUT I still think I should combine the impervious surface land uses. The definitions of all 4 are below and I think the important thing is that they all represent land uses where there are non-negligible impervious surfaces and are considered "developed".

image

image

Show by category:

image

Code, of course! ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) land_cover_attributes_sitecat_combined_again <- site_attributes %>% mutate(pctForested = pctForestDecid + pctForestEverg + pctForestMixed, pctWetland = pctWetlandWoody + pctWetlandHerbaceous, pctAgriculture = pctAgPasture + pctAgCrop, pctDeveloped = pctOpenDev + pctLowDev + pctMediumDev + pctHighDev) %>% select(-c(pctPermIceSnow, pctBarrenLand, pctShrub, pctGrassland, pctForestDecid, pctForestEverg, pctForestMixed, pctWetlandWoody, pctWetlandHerbaceous, pctAgPasture, pctAgCrop, pctOpenDev, pctLowDev, pctMediumDev, pctHighDev)) %>% select(-pctSnow, -starts_with('pctSoil')) %>% select(site_no, site_category_fact, starts_with('pct')) %>% pivot_longer(-c(site_no, site_category_fact), names_to = 'landuse') %>% mutate(`landuse` = gsub('pct', '', `landuse`)) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(landuse), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Percent catchment area') + ggtitle('Land uses by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_no, y = value, fill=landuse)) + geom_bar(stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) ggplot(land_cover_attributes_sitecat_combined_again, aes(x = site_no, y = value, fill=landuse)) + geom_bar(width = 0.75, stat = 'identity',position='stack') + theme_bw() + scico::scale_fill_scico_d() + theme(axis.text.x = element_text(angle=90, size=8)) + facet_wrap(vars(site_category_fact), scales = 'free_x') ```
lindsayplatt commented 9 months ago

Lastly, there were a few that I determined can be summarized by a different attribute -or- just don't give us much information because the distributions are quite similar across categories (or they don't make too much scientific sense).

  1. Delete pctSoilClay, pctSoilSilt, and pctSoilSand. We will use soilPerm (soil permeability) in place of these 3 because that is really what I was trying to reveal by including these soil type metrics
  2. Delete avgStreamSlope, meanSoilSalinity, numDams2013, and pctSoilOM. None of these show much difference in attributes among sites

image

image

lindsayplatt commented 9 months ago

As of 2/21, I am here with the attributes but still have some to check, such as

I also need to revisit stream density, despite one value missing.

image

lindsayplatt commented 9 months ago

For the vegetation indices, the definitions area available here and correspond to "enhanced vegetation index" from MODIS EVI where EVI is a remote sensing metric used to quantify vegetation greenness. This is not actually very useful to us in this context and the land cover type is really want we want, so removing them.

Deleting:

lindsayplatt commented 9 months ago

For precip vs runoff, they look very similary and I also tried calculating the ratio of precip to runoff but end up with very similar distributions across all three. So, I think I will keep the ratio so that both precip & runoff are included.

Deleting: avgPrecip and avgRunoff but keeping precipRunoffRatio.

image

Code for the plots ```r library(targets) library(tidyverse) site_attributes <- tar_read(p6_site_attr) data_to_plot <- site_attributes %>% mutate(precipRunoffRatio = avgRunoff/avgPrecip) %>% select(site_category_fact, avgPrecip, avgRunoff, precipRunoffRatio) %>% pivot_longer(cols = -site_category_fact, names_to = 'attribute') ggplot(data_to_plot, aes(x = site_category_fact, y = value)) + geom_boxplot(aes(fill = site_category_fact)) + facet_wrap(vars(attribute), scales = 'free_y') + theme_bw() + scico::scale_fill_scico_d(begin = 0, end = 0.75) + xlab('Site Category') + ylab('Attribute value (various scales and units)') + ggtitle('Attributes by site categorization') + theme(strip.background = element_rect(fill = 'white', color = 'transparent')) ```
lindsayplatt commented 9 months ago

More on runoff, I did try using the monthly runoff NHD attributes (and averaging by season). I think I will keep these seasonal runoff attributes for now until after running some of the models because they seem interesting and may matter.

Adding:

image

Here is how I calculated seasonal runoff using new NHD+ attributes:

mutate(attr_avgRunoffWinter = mean(c(CAT_WB5100_DEC,CAT_WB5100_JAN, CAT_WB5100_FEB, CAT_WB5100_MAR)),
           attr_avgRunoffSpring = mean(c(CAT_WB5100_APR, CAT_WB5100_MAY)),
           attr_avgRunoffSummer = mean(c(CAT_WB5100_JUN, CAT_WB5100_JUL, CAT_WB5100_AUG)),
           attr_avgRunoffFall = mean(c(CAT_WB5100_SEP, CAT_WB5100_OCT, CAT_WB5100_NOV)))
lindsayplatt commented 9 months ago

Rather than using the snow from the Water Balance Model (WBM), I am going to use % snow * annual precip from the climate section.

They are fairly similar (left = CAT_WBM_SNW, right = CAT_PPT7100_ANN * CAT_PRSNOW)

image
lindsayplatt commented 9 months ago

Topographic wetness index seems rathe complex (see here), so I am going to use other attributes to get at the idea of infiltration.

lindsayplatt commented 9 months ago

Here are all the remaining "groundwater" related attributes. I don't think we need this many. I think it would be nice to stick to as few sources as possible, so I may eliminate the attributes from Zell and Sanford 2020 since there are other options within the NHD+ static attributes.

Deleting:

image

lindsayplatt commented 9 months ago

I don't know what to do with choosing which depth2WT?

image
lindsayplatt commented 9 months ago

Removing stream density which just isn't very interesting or meaningful for this application

image
lindsayplatt commented 8 months ago

Needing to decide which area and salt related attributes to use. Do not need all 6. Leaning away from the ratios and also not planning to use areaSqKm, which is just the individual COMID areas because COMID catchments are specifically made to be similar in size, so doubtful those would tell us much. Note that I changed some of the names in order to get the labels to appear on the resulting plots. See code for those details.

Results from this:

Given these results, I think we can move forward with using both roadSaltCOMID and areaCumulative because they were not really correlated with each other but both were at least somewhat correlated with roadSaltCumulative (so we don't need that one because it would be less likely to add information to our model).


Used a Spearman Rank Correlation to compare how correlated these different attributes are. Followed along with the blogs here and here.

image

image

Correlation matrix:

                   areaCumulative   areaCOMID  areaRatio roadSaltCumulative roadSaltCOMID roadSaltRatio
areaCumulative          1.0000000  0.25540600 -0.8267446         -0.4513342   -0.19378503    -0.7529419
areaCOMID               0.2554060  1.00000000  0.2468535         -0.1420524   -0.01076636     0.3324710
areaRatio              -0.8267446  0.24685355  1.0000000          0.3736243    0.17799200     0.9484655
roadSaltCumulative     -0.4513342 -0.14205237  0.3736243          1.0000000    0.82695146     0.3747887
roadSaltCOMID          -0.1937850 -0.01076636  0.1779920          0.8269515    1.00000000     0.3213864
roadSaltRatio          -0.7529419  0.33247099  0.9484655          0.3747887    0.32138642     1.0000000
Code to generate these correlation plots ```r # Run Spearman Rank Correlation to see if the area & salt attributes are related # so that we only use attributes that are showing different things. library(targets) library(tidyverse) library(GGally) attrs_area_salt <- tar_read(p6_site_attr_rf) %>% select(areaCumulative = areaCumulativeSqKm, areaCOMID = areaSqKm, areaRatio, roadSaltCumulative = roadSaltCumulativePerSqKm, roadSaltCOMID = roadSaltPerSqKm, roadSaltRatio) corr1$p.value < 0.05 plot(attrs_area_salt) cor(attrs_area_salt, method = "spearman") ggcorr(data = attrs_area_salt, method = c("everything", "spearman"), nbreaks = 5, low = "darkred", mid = "white", high = "seagreen4", label = TRUE, label_round = 2) + ggtitle('Spearman Rank Correlation for area and salt attributes') ```
lindsayplatt commented 8 months ago

Trying to find an attribute that gets at "winter severity". Currently comparing "winter duration" based on days between first/last freeze day and average winter air temperature (mean of monthly air temperatures between December and March). Below are the distributions of these attributes along with snow:

image

Results from this:

Given these results, I will move forward with using both avgWinterAirTemp and avgSnow. Needed to choose between either avgWinterDuration OR avgWinterAirTemp because they were correlated with each other and temperature makes more ecological sense because it would control whether salt was used during a snowfall event because salt is not effective below a certain temperature, and would also control whether precip fell as rain, freezing rain, or snow (which could be a more informative distinction between categories in the random forest model than simply the number of days between first and last freezes).


Ran Spearman Rank Correlation for these three winter-related attributes:

image

image

Correlation matrix:

                     avgSnow avgWinterAirTemp avgWinterDuration
avgSnow            1.0000000       -0.6780457         0.6339131
avgWinterAirTemp  -0.6780457        1.0000000        -0.7535945
avgWinterDuration  0.6339131       -0.7535945         1.0000000
lindsayplatt commented 8 months ago

Added transmissivity back into the mix and now checking correlations between "final" list of 17 attributes.

Results from this when considering just those that are highly correlated (I am using >= abs(0.70)):

Final decision from this activity is to remove areaCumulativeSqKm since it is highly correlated to medianFlow and pctOpenWater (more upstream watershed area = bigger rivers). This leaves 16 attributes for the final collection.

image