Aggregate basin chars time series

scworland / restore-2018

scripts for predicting streamflow characteristics in ungaged basins for RESTORE

4 stars 2 forks source link

Aggregate basin chars time series #2

Closed scworland closed 6 years ago

scworland commented 6 years ago

Several of the basin characteristics, e.g., land use and land cover (LULC), are an annual time series. We need to decide on the best way to handle these covariates. Here are some options:

Add a "_year" to each basin characteristic and keep them all as separate columns
Aggregate (mean, sd, etc) the basin characteristics for the period of record (water years)
Recalculate the flow statistics using a windowed period of record (e.g., 1975-2010) and then do the step above for a set period?
Something else?

rrknight-usgs commented 6 years ago

IMO...

I think just for data preservation, adding the _yyyy to the LULC data would be good housekeeping, no matter how we use the data.
How to describe the LULC... there could be several thoughts on this, and maybe we would be well served by bringing someone else into the discussion. It seems that several descriptors of the LULC over the POR would be reasonable to consider (mean, SD...), but I also am curious about whether there's ever been a covariate that represents LULC over time (change per unit time). There would be some tough ?s to answer here, like assumption of linear change per yr....
I will wait to hear your wisdom on the 3rd question. It seems that keeping a common window period of record would be helpful to develop the models.... but it may winnow out n=1320 considerably. I've never done this when the for inclusion was more than 30 yrs. Thoughts?

Extra thought: As I mentioned to Scott this morning, I think exploring the ability of the predictive models to predict on a different time period will be useful. We need to explore limitations to that thought. One thing that crossed my mind is to have a common window of time on which the models are built, and then only testing / allowing predictions from the eqns for time-sensitive covariates within that window. For example, we say that only sites w time-series Q data from 1970 - 2010 will be considered in model development. Then, even though a given site had record that fed the model from 1975 - 1990, conditions for this site could be predicted for other periods within the over all model time domain as long as input data are available, which they would be... if that makes any sense at all. I know the whole temporal component can be messy...

If we can form the question really well... I could send it to my friend Jan Seibert.... he seems to have good thoughts on statistical things like this.

scworland commented 6 years ago

If we aggregate the LULC data, then what "_year" would follow? The years? Like mean_forest_1970_2010?
Changes in LULC sounds cool to me, but what does that mean in terms of the LHS of the equation (i.e., flowstats)? For example, how does a 7Q10 that was calculated from 1970-2010 relate to a change in LULC over the same time period? It seems like the LHS would need to be a change as well. Similar to an indexed financial measurement (or climate) where everything is chained to some year and measured as a deviation from that year.
I will pass this one to WIll.

Extra question You should be able to send Jan a link to this issue and he can comment directly if he has a github account. (email will work fine too!)

👍 Emojis on github 👍 I almost forgot the most important tool on github. Type a colon and then a word... e.g., ":fire" without the quotes... endless opportunities

ghost commented 6 years ago

I have read the threads now deep wisdom as I don't have it wrapped into my understanding as yet.

Scott, one thing, which is not a problem for me to adapt, but as we get big I tend to organize things into R environments.

You will see this one the monster outputs of the "gfactor" computation base. Rodney, I expect to get akqdecay to github in next couple of weeks and break from CVS. Then Scott can pull this into your data management plan to keep others happy.

-wha (iPhone)

On Oct 16, 2017, at 3:32 PM, Scott Worland notifications@github.com wrote:

If we aggregate the LULC data, then what "_year" would follow? The years? Like mean_forest_1970_2010? 2.

Changes in LULC sounds cool to me, but what does that mean in terms of the LHS of the equation (i.e., flowstats)? For example, how does a 7Q10 that was calculated from 1970-2010 relate to a change in LULC over the same time period? It seems like the LHS would need to be a change as well. Similar to an indexed financial measurement (or climate) where everything is chained to some year and measured as a deviation from that year. 3.

I will pass this one to WIll.

Extra question You should be able to send Jan a link to this issue and he can comment directly if he has a github account. (email will work fine too!)

👍 Emojis on github 👍 I almost forgot the most important tool on github. Type a colon and then a word... e.g., ":fire" without the quotes... endless opportunities

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/scworland-usgs/restore-2018/issues/2#issuecomment-337009011, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5JQI18iaCHnVpUuUBM3RvOFjqAge-Nks5ss69FgaJpZM4P673x .

rrknight-usgs commented 6 years ago

@wasquith-usgs CVS?

@scworland-usgs Let's collectively frame the question / problem / issue that you were talking about yesterday in regard to proper / improper use of the models for different temporal periods. Your thoughts on the topic are much deeper than mine... please take a stab at it (maybe need for new issue so that if/when I share with Jan he won't have to weed through all the other). I'll add on questions and what-not.

Am I correctly using the @ handles? ❓ I like this form of communication. 💬

scworland commented 6 years ago

@rrknight-usgs you are using the @ handles correctly! 🥇 I am also glad you are jumping on board with this. We will have a good record of what we did and why we did it.

My main concern was treating time as an exchangeable group. e.g., 1975 is exchangeable with 1962 etc. I was also not sure how we would handle the LULC classes to make everything sensible. For example, we couldn't just drop in a percent developed number into the model without adjusting the other classes (because has to sum to 100%). I think we decided that using counterfactual plots (or partial dependence functions) would accomplish the same goal but avoid these conceptual issues*

I extracted all of the basin characteristics from the gdbs and saved them as feather files in the data folder. Feather files are much faster to use and load directly as a dataframe in R or python (I will normally use R, but I did use python to extract the tables from the dbfs). Next time you both do a git pull it should load all those files on your machine. To open one of the up just do this:

install.packages("feather")

library(feather)
bfi <- read_feather("data/basinchars/BFI.feather")

scworland commented 6 years ago

I averaged each LULC class across basins for each year.

# extract LULC file names
list_file <- list.files(path="data/basinchars/",pattern = glob2rx("*LULC*.feather"))

lulc_comb <- NULL
for (i in 1:length(list_file)){
  fname=list_file[i] # grab first file name 

  # turn off warnings
  options(warn=-1) 

  lulc_hold <- read_feather(paste0("data/basinchars/",fname)) %>% 
    mutate_all(funs(replace(., . == -9999, NA))) %>% # replace -9999 with NA
    select(-SITE_NO) %>% # remove siteno for plots
    mutate(year=parse_number(fname)) %>% # extract year from filename
    summarize_all(funs(mean),na.rm=T) # take mean across basins
  lulc_comb <- rbind(lulc_comb,lulc_hold) # append dataframes

  # turn on warnings
  options(warn=0) 
}

and the resulting plot:

If we are going to combine classes we need to think pretty hard about it. At least looking at the basin averages, you wouldn't want to combine mixed forest and evergreen for example (trending opposite directions). Here is a random subset of 10 sites (I dropped a the "Mechanic" classes and "Perennial"):

lulc_comb %>%
  mutate(siteno=as.factor(SITE_NO)) %>%
  select(-SITE_NO,-Perennial_) %>%
  select(-contains("Mechanic")) %>%
  filter(siteno %in% sample(levels(siteno),10)) %>%
  arrange(siteno) %>%
  gather(key,value,-year,-siteno) %>%
  ggplot() + 
  geom_line(aes(year,value,color=siteno)) + 
  facet_wrap(~key, scale="free")

There are some interesting step changes in some of the classes.

Here is another random 10 sites to illustrate the variation (i.e., same plot as above, but different sites):

Here are the questions

Are we ok with taking the mean or median of each class for the POR over which the streamflow stat was calculated?
If we do average over time, how do actually make predictions for the ungaged sites? i.e., how to we summarize the LULC for those basins? Over which years? One option is to just pick an arbitrary range of years.
Do we want to combine or drop any of the LULC classes?

scworland commented 6 years ago

The following plots might help in making a decision about how to handle the period of record.

rrknight-usgs commented 6 years ago

My responses in regard to the LULC comment are in bold

Are we ok with taking the mean or median of each class for the POR over which the streamflow stat was calculated? To answer the question, I think median would be the way to go, given they way some classes jerk around through time. Given the DV hydrograph simulation approach, seem that we should use the yearly values maybe? I can see why we wouldn't use the annual series BECAUSE of how the class values for a site suddenly change.
If we do average over time, how do actually make predictions for the ungaged sites? i.e., how to we summarize the LULC for those basins? Over which years? One option is to just pick an arbitrary range of years. Does consolidating the POR to something common over the study area eliminate this problem? If that is the case, then the LULC values would be handled in the same way as item 1 - median.
Do we want to combine or drop any of the LULC classes? I would think combining some would be good, but we have to be mindful that we keep categories that may be useless in some basins, but very informative in others... for example, maybe some basins have a lot of ag, but not much if any wetland, whereas other basins close to coast have the opposite.

ghost commented 6 years ago

I am finally back. Reading through the thread, I don't have much to add at this point. But, I will comment on the : SCOTT SAYS: If we aggregate the LULC data, then what "_year" would follow? The years? Like mean_forest_1970_2010? ASQUITH: My thinking is that column appending like this is usually a good idea. Acts to self document without embedded such content in attributes().

SCOTT SAYS: Changes in LULC sounds cool to me, but what does that mean in terms of the LHS of the equation (i.e., flowstats)? For example, how does a 7Q10 that was calculated from 1970-2010 relate to a change in LULC over the same time period? It seems like the LHS would need to be a change as well. Similar to an indexed financial measurement (or climate) where everything is chained to some year and measured as a deviation from that year. ASQUITH: A possibility is to estimate the 7Q10 in a different way via a GAMLSS. Given a LHS of annual 7-day minimums, estimate a model

GAM <- gam(LHS~pb(LULC, df=2), sigma.fo=~pb(LULC, df=2), no.fo=~1, family=GG(mu.link = "log", sigma.link = "log", nu.link = "identity")

Actually the code is more complicated by we simultaneously estimate the location, scale, and shape of a 3-parameter generalized gamma (Pearson III) and estimate the 0.90 nonexceedance probability (10-year event). This is a continuous estimation of the first three moments of the data. So if you have changing 7Q10 and changing LULC then we can test that in a setting that feels "regression like."

scworland commented 6 years ago

We took a completely different approach using NLDI. See nldi_funs. The lead for this effort is Dave Blodgett, USGS