ua-snap / shiny-apps

R Shiny apps
234 stars 191 forks source link

CMIP3 / CMIP5 app - Data extraction and addition of new statistics #89

Closed leonawicz closed 9 years ago

leonawicz commented 10 years ago

In addition to the spatial means extracted for each region from the downscaled data for this app, consider extracting various quantiles of the spatial distribution and perhaps measures of dispersion as well.

  1. With the transfer of data files out of the app and onto a server where they are loaded on demand, the app is fully scalable. The addition of data is welcome.
  2. The ability to make this app more than just yet another exercise in looking too closely at means and means alone would represent by far perhaps the biggest gain in value the app could achieve.

For "point" data (cities, represented by the pixel in which their coordinates fall), none of this would apply. However, data can be extracted from circular buffers around cities at various radii and the above could also be applied in these areas, albeit with greater volatility, but this would be interesting. When cities are selected, the user could choose the radius in km. If no buffer is used, things can remain as they are.

The use of buffers would introduce some complications when working with city data. Selecting a city would introduce a data frame subset that had an additional factor with multiple levels, buffer radius. The presence and use of this new categorical variable would have to be integrated into the app is various places in the code while at the same time no such additional factor variable would be present when working with regions because they have a fixed area.

Additional quantiles and metrics extracted for regions would not be difficult to integrate on its own, however, because this option could easily be restricted to regions, and not made available to cities, and for the most part the app code could remain as is. The only change would be the addition of a data selection menu offering multiple metrics as opposed to implicitly providing only the spatial mean for a region. If the user can only choose one metric at a time, they are simply generating the same format data frame, but with different values in the Value column.

The only edge cases to be dealt with would be things like labeling plots properly, for instance with "95th percentile Temperature" instead of a hardcoded "Mean Temperature". This does open the door some allowing for some hard to interpret plots, such as "Mean Precipitation" which is actually a mean (over whatever variables the user has collapsed their data) of an extracted spatial mean (mean of means), in a bar plot could instead be, say, a mean of standard deviations, or a mean of medians, or a mean of 95th percentile precipitation levels in the region. When it is just a mean of means, users get away with being lazy in terms of their obfuscation of what the data are. "It's a mean." When it's not just layers of means users will have to pay more attention and really think about what they are graphing. That should be a good thing. But it does mean there will be abilities to plot more data which can be hard to interpret.

I would experiment with this for the relatively small number of regions and restrict the application to regions. Ignore the large number of cities. This will also require some code enhancements to the data extraction process, which is definitely overdue for it.

After this is done, I can consider the buffer option for cities and how that may effect the app with the introduction of another factor.

A final consideration, and a potential ultimate achievement, would be that if enough quantiles can be extracted, there would be a possibility of estimating spatial distributions for a single time point. Of course, a small representative sample would be best. It would be large enough to make a histogram as well as a density curve, and they could be made directly rather than through less robust reconstructions from a handful of quantile extractions. This would definitely be more data heavy. Storage would not be an issue, and data would still be loaded on demand. But it would require some thought into how to sensibly restrict the simultaneous combined look at spatial and temporal distributions. The app looks essentially at temporal distributions of data which are collapsed across space (for now, by way only of the mean). Adding the spatial component would be a boon, but it is reasonable to expect to look at histograms and density curves of spatial distributions of temperature and precipitation across time.

leonawicz commented 10 years ago

Extraction script updates in progress: #95 These will now provide additional statistics (currently, spatial mean, standard deviation, 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentiles). They will also provide spatial distributions rather than just single-value aggregate statistics: #96