lizzieinvancouver / grephon

0 stars 1 forks source link

PEP x ITRB data overlap plot #26

Closed lizzieinvancouver closed 7 months ago

lizzieinvancouver commented 1 year ago

@rdmanzanedo I think the best place to find the PEP data (which we can also re-request if we want) may be in Cat's repo here. Hopefully @cchambe12 can answer queries. ...

I think we may have always worked off data I received in 2015 ... but we asked in 2019 and they wrote back:

There is no need for any additional information as long your project comply with our data use Data Use Policy (data must not used for commercial projects).

Nevertheless it's good that you sent this mail as the download on the website is not up-to-date. You can access the current export over the link below, it's an 130MB zip - deflated ~1GB.

Not clear to me if we ever did download the new copy.

PEP data policy old link may still work ...

I'll also email you both a copy of a link the old data in case.

lizzieinvancouver commented 1 year ago

See also issue #25 ... Figure: PEP725 x ITRB (and maybe USA NPN) -- species/space/time

rdmanzanedo commented 1 year ago

Ok, first look at this and preliminary analyses witht he info that Lizzie sent me (Thanks!!):

Spatial and elevational aspect:

simplemap

First conclusions: PEP is very poorly distributed but has a lot more depth of data, very Germany, UK focussed and warm temperate (C-fb climate classification) heavy. ITRDB is more balanced in that sense but has much less spatial replication.

Ecoregions For equivalence: ecoregions_key

The areas with maximum overlap are in central Europe and Uk, as expected by their raw distributions Main_normalized_overlap And interestingly the assymetric areas (areas where one of them is well distributed but the other one isn't (so easy places to start building a higher coherence between them:

Differences

THIS DOESN'T TAKE INTO CONSIDERATION THE TAXA THAT IS IN EACH GRIDPOING

rdmanzanedo commented 1 year ago

@lizzieinvancouver @cchambe12 I am a bit confused on the file you passed me on the data part for the species, the 'pep725_data.csv' file has 4 headers but 5 columns. Which one is the species and how do you recommend aggregating that data into something simple like target species per coordinates and years covered? Do we have maybe that data available already by any chance? I have it for the ITRDB.

lizzieinvancouver commented 1 year ago

@rdmanzanedo Wow! This is cool. Thanks for all this.

As for your query ... I don't have the data you mentioned. We tend to extract a few well studied wild species and work from there, so we never analyse the FULL data much.

And, yikes -- that is annoying about the data file. My apologies. Here's how I read it in:

phendata <- read.csv2("PEP725_records/PEP725_data.csv", skip=1, header=FALSE, col.names=c("PEP_ID", "PLANT_ID", "CULT_ID", "BBCH", "YEAR", "DAY"))

In case it helps, all the related code:

### Started 15 January 2016 ###
### By Lizzie ###

## Looking at PEP725 species ###

# safety feature(s)
rm(list=ls()) 
options(stringsAsFactors=FALSE)

# set up
setwd("~/Documents/git/projects/misc/pep725/")
library(plyr)
library(ggplot2)
library(rgdal)

# which species and phasecodes show up?
phendata <- read.csv2("PEP725_records/PEP725_data.csv", skip=1, header=FALSE,col.names=c("PEP_ID", "PLANT_ID", "CULT_ID", "BBCH", "YEAR", "DAY"))

goo <- ddply(phendata, c("PLANT_ID", "CULT_ID", "BBCH"), summarise,
             N=length(PLANT_ID),
             meanyr=mean(YEAR)
             )

goo <- goo[order(-goo$N),] # order by most to least N
hist(goo$N, breaks=200)
hist(goo$N, xlim=c(0,200), breaks=20000)

# station data 
statz <- read.csv2("PEP725_records/PEP725_stations.csv", header=TRUE)

# merge in cultivar and BBCH
plantz <- read.csv2("PEP725_records/PEP725_plant.csv", header=TRUE)
bbch <- read.csv2("PEP725_records/PEP725_BBCH.csv", header=TRUE)
cult <- read.csv2("PEP725_records/PEP725_cultivar.csv", header=TRUE)

dater <- merge(goo, bbch, by.x="BBCH", by.y="bbch", all.x=TRUE)
dater <- merge(dater, plantz, by.x="PLANT_ID", by.y="plant_id", all.x=TRUE)
dater <- merge(dater, cult, by.x="CULT_ID", by.y="cult_id", all.x=TRUE)

write.csv(dater, "pep725commonspp.csv", row.names=FALSE)

dater$plantstage <- paste(dater$sci_name, dater$description)
dater <- dater[order(-dater$N),] # order again by most to least N

# okay so let's look at those with N>50

dater50 <- subset(dater, N>50) # that halves things, from 690 to 367
dater100 <- subset(dater, N>100)
dim(dater100) # hmm, only takes us down a further couple dozen: to 322

length(unique(dater$sci_name)) # 122
length(unique(dater50$sci_name)) # 79

# which BBCH stages show up across multiple species and commonly?
commonbbchsp <- aggregate(dater["N"], dater["description"], FUN=length)
commonbbch <- aggregate(dater["N"], dater["description"], FUN=sum)

commonbbchsp[order(-commonbbchsp$N),]
hist(commonbbchsp$N, breaks=40)

keepbbch <- subset(commonbbchsp, N>5)

datercommon <- dater50[which(dater50$description %in% keepbbch$description),] 
length(unique(datercommon$sci_name)) # 77, ha! Saved one species

datercommonspp <- aggregate(datercommon["N"], datercommon[c("sci_name","cult_name")],
    FUN=length)

write.csv(datercommonspp, "pep725topspp.csv", row.names=FALSE)

# make a few maps of the Fu spp.

# get the map and set the theme
wmap <- readOGR("../maps/ne_110m_land", layer="ne_110m_land")
wmap.df <- fortify(wmap)

theme.tanmap <- list(theme(panel.grid.minor = element_blank(),
                        # panel.grid.major = element_blank(),
                        panel.background = element_rect(fill = "grey90",colour = NA),
                        # plot.background = element_rect(fill=NA),
                        axis.title.x = element_blank(),
                        axis.title.y = element_blank(),
                        plot.title = element_text(size=22),
                        legend.position = "left"))

# Fu et al. 2015 species:
# Alnus glutinosa, Betula pendula, Aesculus hippocastanum, Fagus sylvatica, Tilia cordata, Quercus robur, Fraxinus excelsior

subb <- subset(phendata, PLANT_ID==108) # Fagus sylvatica
# subb <- subset(phendata, PLANT_ID==102 & CULT_ID==40) # Alnus glutinosa
# subb <- subset(phendata, PLANT_ID==101) # Aesculus hippocastanum
# subb <- subset(phendata, PLANT_ID==106 & CULT_ID==20) # Betula pendula
# subb <- subset(phendata, PLANT_ID==129 & CULT_ID==70) # Tilia cordata
# subb <- subset(phendata, PLANT_ID==111) # Quercus robur
# subb <- subset(phendata, PLANT_ID==120) # Fraxinus excelsior

subbstat <- merge(subb, statz, by="PEP_ID")
subbstatagg <- aggregate(subbstat[c("PEP_ID")], subbstat[c("BBCH", "LON", "LAT", "ALT")], FUN=length)
names(subbstatagg)[names(subbstatagg)=="PEP_ID"] <- "obs.n"
# make the lat, longs numeric (note: should look into why they are not already)
subbstatagg$lat <- as.numeric(subbstatagg$LAT)
subbstatagg$lon <- as.numeric(subbstatagg$LON)

ggplot() + 
  geom_polygon(dat=wmap.df, aes(long, lat, group=group), fill="grey80") +
  coord_cartesian(ylim=c(30, 75), xlim=c(-15, 40)) +
  geom_point(data=subbstatagg, 
             aes(x=lon, y=lat, size=obs.n, fill=BBCH), 
             colour="dodgerblue4", pch=21) +
  theme.tanmap
lizzieinvancouver commented 1 year ago

@rdmanzanedo I think if you select down to the following BBCH codes you will have early or late season events for most tree species:

bbchleaforsim <- c(7:30,101)
bbchendevents <- c(92:97)
bbchleafandend <- c(bbchleaforsim, bbchendevents)

bbch[which(bbch$bbch %in% bbchleafandend),]

We could quibble over trees that flower first will be missed, but this is a good approximation I think.

lizzieinvancouver commented 1 year ago

And the output I meant to post so you can see the stages captured:

> bbch[which(bbch$bbch %in% bbchleafandend),]
   bbch                                                   description
2     7                                        Beginning of sprouting
3    10                            First leaves seperated (mouse ear)
4    11                     Leaf unfolding (first visible leaf stalk)
5    13                                          Leaf unfolding (50%)
6    14 Leaf unfolding (4.true leaves, leaf pairs or whoris unfolded)
7    15 Leaf unfolding (5.true leaves, leaf pairs or whoris unfolded)
8    19 Leaf unfolding (9.true leaves, leaf pairs or whoris unfolded)
9    30                                           beginning of growth
31   92                                 Leaves beginning to discolour
32   94                             Autumnal coloring of leaves (50%)
33   95                                       50 % autumnal leaf fall
34   96                                     Coloring of leaves (100%)
35   97                        End of autumnal leaf fall (95% fallen)
37  101                                          25 % green in spring
lizzieinvancouver commented 1 year ago

@rdmanzanedo Any updates on this? Thanks.

rdmanzanedo commented 1 year ago

Hey, sorry, I am now on this. Updated with the new metadata of tree rings, available here: https://www.ncei.noaa.gov/pub/data/paleo/treering/ (metadata as of jan 2022)

Thanks for those Lizzie, super useful.

rdmanzanedo commented 1 year ago

Figurecomposed

rdmanzanedo commented 1 year ago

@lizzieinvancouver super useful your code, thanks! I went here more descriptive, leaving the interpolation part for later after we discussed (maybe too much?).

I have doubts about how to process the pep725 the best (why are there so many records in your list with the same species name? is it different phenophases measured? Also thought it may be worth eliminating the non-woody plants and crops from pep725. do you know if that is possible?

Let me know, once it is designed and looking pretty is easy too make small changes :)

lizzieinvancouver commented 1 year ago

@rdmanzanedo These are so cool! I find b-c especially interesting.

I have doubts about how to process the pep725 the best (why are there so many records in your list with the same species name? is it different phenophases measured?

Probably -- even once you subset to the phenophases I recommended some of the places probably try to record MULTIPLE stages of very similar phases (budburst and leafout can be 4-5 different `phases').

Also thought it may be worth eliminating the non-woody plants and crops from pep725. do you know if that is possible?

I went through code that I have and don't see anything quickly that removes these, though I found a csv where I seem to have tried to figure out which top species were domesticated or herbs or such ... I will send it along.

rdmanzanedo commented 1 year ago

Added the climate space graph: Figure_composed2

(the taxonomic one is still incorrect)

rdmanzanedo commented 1 year ago

about the taxonomic. Our assessment was correct, the different phenological states is what makes the counts: image

Now question to @cchambe12 and @lizzieinvancouver : should I: 1) assume they are redundant (they measured the same individuals in all phenophases and therefore keep 1 that I consider most relevant for us?) 2) add the total number for all phenophases of each species (add all of the aesculus)? 3) take the largest value for each species in whichever phenophase it is as the 'most complete'?

What do you think is the most useful approach given what you know of the database? (I don't think this affects the other graphs)

lizzieinvancouver commented 1 year ago

@rdmanzanedo Cool climate space graph!

I would personally go with (3) ... and would rank next (1) and then (2), though I think we could defend any of the three approaches, so if you have a preference, you could go with it.

rdmanzanedo commented 1 year ago

@lizzieinvancouver great! I think I agree. So this is the correct final figure then, using the maximum number of each species, whichever the phenophase was measured: Figure_composed2

reference for figure caption: Fig. X: Data overlap between the two major databases of growth (International Tree Ring Data Bank, ITRDB, orange) and plant phenology (Pan European Phenology Project, PEP725, blue). Both databases are compared in terms of their spatial distributions (a), temporal overlaps (b), coverage of environmental conditions in climate space (c) and taxonomical representation (d). Note that the number of tree ring chronologies in (b) are composed by multiple trees per site, typically 10-20. Climatic data from Worldclim database ver. 2.1 at 2.5° grid resolution. PEP725 records in d) show the largest records for any given phenophase per species.

Let me know of anything to add or modify! :D

lizzieinvancouver commented 7 months ago

This figure seems great, thank you @rdmanzanedo !