phenology / springtime

Spatiotemporal phenology research with interpretable models
https://springtime.readthedocs.io
Apache License 2.0
3 stars 2 forks source link

Explore fetching data from plant phenology database #2

Closed Peter9192 closed 1 year ago

Peter9192 commented 1 year ago

A really useful resource for phenology data is http://plantphenology.org/.

The data are also available through a REST API (https://github.com/biocodellc/biscicol-server) and an R package is available to build a request: https://github.com/ropensci/rppo/blob/master/R/ppo_data.R. See documentation here: https://docs.ropensci.org/rppo/reference/rppo-package.html

It would be nice to see if we can build a similar, simple python script that uses this REST API to download data (ideally pep725) for a (small) region in Europe, for selected parameters, etc.

Peter9192 commented 1 year ago

It would be nice if we could reproduce our local file with the following parameters:

phase_id
>> array([ 11,  60,  95,  65,  61,  93,  97,  69, 205,  15,  19, 203,  87,
           213,  89, 209,  10, 286, 200, 210,  51,   1,   7,  59,  63,  67,
           71,  79,  85, 201])
# though perhaps only 60 and 61 are interesting for now

genus, species
'Syringa', 'Syringa vulgaris'

gss_id
>> array([2350100])

lon.max(), lon.min(), lat.max(), lat.min()
>> 30.5573, -10.2167, 66.122, 40.65

year
>> 2001 - 2021
Peter9192 commented 1 year ago

I noticed that the R package has a hardcoded source filter: params$[source] <- "USA-NPN". This might explain why we didn't find any results over Europe.

Note that there's also a get_traits function that fetches additional info from the NPN data sources based on the EventID returned by rppo_data. I suppose this only works for NPN data. https://github.com/ropensci/rppo/blob/f6d0c203bc4ef46c2c6525ad086956a770d98bde/R/ppo_traits.R#L29

Peter9192 commented 1 year ago

With #5 I can download some data, but I'm still a bit unsure how to interpret it. E.g.

In [119]: df = download(
     ...:     genus="Syringa",
     ...:     source="PEP725",
     ...:     year="[2000 TO 2021]",
     ...:     latitude="[40 TO 70]",
     ...:     longitude="[-10 TO 40]",
     ...:     termID="\"obo:PPO_0002330\"",  # flowers present
     ...:     )
Retrieving data from https://biscicol.org/api/v3/download/_search?limit=5&q=genus:Syringa+AND+source:PEP725+AND+year:[2000 TO 2021]+AND+latitude:[40 TO 70]+AND+longitude:[-10 TO 40]+AND+termID:"obo:PPO_0002330"

In [120]: df
Out[120]:
   dayOfYear  year    genus specificEpithet  latitude  longitude                                             termID  source                                            eventId
0        126  2006  Syringa        vulgaris   49.6833     9.9000  obo:PPO_0002000,obo:PPO_0002025,obo:PPO_000202...  PEP725  urn:phenologicalObservingProcess/52095389-ad71...
1        163  2006  Syringa        vulgaris   47.4167    13.6333  obo:PPO_0002000,obo:PPO_0002025,obo:PPO_000202...  PEP725  urn:phenologicalObservingProcess/a2848317-071a...
2        124  2006  Syringa        vulgaris   52.5500    13.7833  obo:PPO_0002000,obo:PPO_0002025,obo:PPO_000202...  PEP725  urn:phenologicalObservingProcess/dc1e196d-0e4d...
3        128  2006  Syringa        vulgaris   51.1833    10.7000  obo:PPO_0002000,obo:PPO_0002025,obo:PPO_000202...  PEP725  urn:phenologicalObservingProcess/36cecd3e-d139...
4        129  2006  Syringa        vulgaris   52.5833    13.5000  obo:PPO_0002000,obo:PPO_0002025,obo:PPO_000202...  PEP725  urn:phenologicalObservingProcess/8f213ffb-339f...

My best guess is that each row describes a moment in time when someone looked at a plant. At that point, they wrote down all the characteristics that were present (the termIDs), so if there were leafs, they included leafs, if there were flowers, they included flowers as well, etc. In that case, the dayOfYear entry not necessarily relates to the time when these characteristics were first observed, but rather to the date of the measurement. Consequently, to get the "day of first bloom", we'd have to find the earliest dayOfYear where flowers present was observed, and hope that they measured frequently enough to get an accurate value. Could that be correct?

Peter9192 commented 1 year ago

Okay, I think I finally understand. This paper shed some light:

Specifically, the PPO needed to be [...] compatible with the data and data collection methods of USA-NPN and NEON, which use status-based monitoring, and PEP725, which uses event-based monitoring.

Apparently, the makers of the PPO have chosen to stick with a status-based description. This means they had to translate the PEP725 phenotypes to the PPO traits. It's a bit unclear how they did this, but I suspect something like this: the event "first leaves present at DOY 118" was converted to a "state observation at DOY 118 where the following trait was observed: more than 0 leaves present".

Consequently, if you want to convert back from the PPO to events, you need to do the inverse, which they describe in their example case:

To estimate leafing out dates, we used all observations of plants with the PPO trait ‘true leaves present’ that did not also have the trait ‘senescing true leaves present’, and to estimate flowering dates, we used all observations of plants with the PPO trait ‘flowers present’ that did not also have the trait ‘senesced flowers present’. [...] the data were filtered to only keep the earliest relevant observation for each unique combination of grid cell and year.