Extracting daily (summertime) data from 1000 points over 10 GCMs

Hi Meaghan,

Thanks for the question and sharing your location data. From what you shared you are looking to extract data for 1012 stations, for 5 parameters, 10 GCMs, 107 days over 79 years. A total of 385,330,005 records! This is definitely a TON of data.

What is fortunate about your locations are that they are pretty dense, therefore I suggest using the areal extraction and NOT point extraction to get what you want. This will allow for more efficient climateR calls and the point values can be extracted as a post-processing step from the returned raster stacks.

To get a bounding area, the AOI package can be called on you points:

library(climateR)
library(sf)
library(raster)

load('./points.RData') # loads your point data called 'samp'

samp

class       : SpatialPointsDataFrame 
features    : 1012 
extent      : -83.84168, -80.81692, 34.56804, 37.29032  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 11

p = AOI::getBoundingBox(samp) %>% st_as_sf() # convert to bounding box

plot(samp)
plot(p, add = T)

With this extent the below example will get summer-time data for 3 years, 10 GCMs, 5 variables for all of your locations:

# Define GCMs
gcms = c("bcc-csm1-1-m","CanESM2","CNRM-CM5", "CSIRO-Mk3-6-0", "GFDL-ESM2M","HadGEM2-ES365", "HadGEM2-CC365","IPSL-CM5A-MR", "MIROC5","MIROC-ESM-CHEM", "CCSM4")

#Define Parameters
params = c('tmax', 'tmin', 'rhmax', 'rhmin', 'prcp')

# Create a vector of GCM/parameter pairs that match climateR outputs
g.fin = levels(interaction(gcms,params,sep="_"))

# Create an empty list equal in size to the GCM/parameter pairs. We will fill this in a loop:
mylist <- setNames(vector("list", length = NROW(g.fin)), g.fin)

system.time({

for(y in 2020:2022){  
  yy = getMACA(AOI = p,
          model = gcms,
          param = params,
          startDate = paste0(y, "-05-01"),
          endDate   = paste0(y, "-08-15")
  )

    for(i in 1:length(mylist)){
      maca.index =  which(names(yy) == g.fin[i])
      list.index =  which(names(mylist) == g.fin[i])

      # extract point level values and tranpose matrix so that dates are rows and points are columns
      xt = t(raster::extract(yy[[maca.index]], points)) 

      # Convert to data.frame and add date column from rasterStack layer names
      r = data.frame(gsub("X", "", gsub("\\.", "-", names(yy[[maca.index]]))), xt, stringsAsFactors = F)

      # set column names to the point id in your origional dataset
      colnames(r) = c('dates', paste0("X.", samp$pt_ids))
      rownames(r) = NULL

      # Add (bind) data to the appropiate element of 'mylist'
      mylist[[list.index]]  <- rbind(mylist[[list.index]], r)
    }

  message("Finished processing year ", y)

}
})

   user  system elapsed 
 75.564  22.670 152.149

Each year takes about ~50 seconds to process and should scale pretty linearly allowing you to process ~ 95,640 values per second, and your whole data request in ~1 - 1.5 hours

The result of this loop is a list of data.frames as shown below:

str(mylist, max.level = 1)
List of 50
 $ bcc-csm1-1-m_prcp   :'data.frame':   321 obs. of  1013 variables:
 $ CanESM2_prcp        :'data.frame':   321 obs. of  1013 variables:
 $ CNRM-CM5_prcp       :'data.frame':   321 obs. of  1013 variables:
 $ CSIRO-Mk3-6-0_prcp  :'data.frame':   321 obs. of  1013 variables:
 $ GFDL-ESM2M_prcp     :'data.frame':   321 obs. of  1013 variables:
 $ HadGEM2-CC365_prcp  :'data.frame':   321 obs. of  1013 variables:
 $ HadGEM2-ES365_prcp  :'data.frame':   321 obs. of  1013 variables:
 $ IPSL-CM5A-MR_prcp   :'data.frame':   321 obs. of  1013 variables:
 $ MIROC-ESM-CHEM_prcp :'data.frame':   321 obs. of  1013 variables:
 $ MIROC5_prcp         :'data.frame':   321 obs. of  1013 variables:
 $ bcc-csm1-1-m_rhmax  :'data.frame':   321 obs. of  1013 variables:
 $ CanESM2_rhmax       :'data.frame':   321 obs. of  1013 variables:
 $ CNRM-CM5_rhmax      :'data.frame':   321 obs. of  1013 variables:
 $ CSIRO-Mk3-6-0_rhmax :'data.frame':   321 obs. of  1013 variables:
 $ GFDL-ESM2M_rhmax    :'data.frame':   321 obs. of  1013 variables:
 $ HadGEM2-CC365_rhmax :'data.frame':   321 obs. of  1013 variables:
 $ HadGEM2-ES365_rhmax :'data.frame':   321 obs. of  1013 variables:

...

Each internal data.frame will have a column for the date and values for each point:

str(mylist$CanESM2_prcp, max.level = 2)
'data.frame':   321 obs. of  1013 variables:
 $ dates: chr  "2020-05-01" "2020-05-02" "2020-05-03" "2020-05-04" ...
 $ X.1  : num  5.72 0 0 13.28 1.47 ...
 $ X.2  : num  5.76 0 0 13.85 1.82 ...
 $ X.3  : num  6.04 0 0 13.98 1.99 ...
 $ X.4  : num  6.12 0 0 13.88 1.9 ...
 $ X.5  : num  7.16 0 0 14.28 1.75 ...
 $ X.6  : num  6.21 0 0 13.57 1.82 ...
 $ X.7  : num  6.51 0 0 13.94 2.03 ...

...

From there you can plot the data....


par(mfrow = c(3,1))
plot(x = as.Date(mylist$`bcc-csm1-1-m_prcp`$dates), 
     y = mylist$`bcc-csm1-1-m_prcp`$X.5, pch = 16, col = 'blue',
     main = 'Location 5 PRCP', xlab = "Dates", ylab = "PRCP (mm)")
plot(x = as.Date(mylist$`bcc-csm1-1-m_prcp`$dates), 
     y = mylist$`bcc-csm1-1-m_tmax`$X.5, pch = 16, col = 'red',
     main = 'Location 5 TMAX', xlab = "Dates", ylab = "TMAX (K)" )
plot(x = as.Date(mylist$`bcc-csm1-1-m_prcp`$dates), 
     y = mylist$`bcc-csm1-1-m_rhmax`$X.5, pch = 16, col = 'purple',
     main = 'Location 5 RHMAX', xlab = "Dates", ylab = "RHMAX")

... or write each list element to a CSV...

out_path = "./maca_climate_files/"

for(i in 1:length(mylist)){
  write.csv(mylist[[i]], file = paste0(out_path, names(mylist)[i], ".csv"))
}

Hope that helps, and thanks for pushing these functions to the limit!

Mike

mikejohnson51 / climateR

Extracting daily (summertime) data from 1000 points over 10 GCMs #6