rbchan / unmarked

R package for hierarchical models in ecological research
https://rbchan.github.io/unmarked/
37 stars 25 forks source link

`unmarkedFrame` class for continuous-time models (`unmarkedFrameContinuous`) #269

Open leapautrel opened 6 months ago

leapautrel commented 6 months ago

Hi! For continuous-time models, we need a new unmarkedFrame class. I thought that we could create a subclass of unmarkedFrame (e.g. unmarkedFrameContinuous) as a base, and then add subclasses of this class for specific models. Here are my proposed specifications for this class.

What data do we need?

For each detection event, we absolutely need:

Depending on the model, we could also need other informations: the species, the season...

Data provided by the user

In my proposition, here are the data the user should provide to create an unmarkedFrameContinuous object. I split them in several dataframes as this seems to be the most logical and safe (relational database-like) and is actually how sensor data are organised in tools I know of (such as the camtrapR R package and the Wildlife Insights exports)

obsData

obsData contains the observation data. I do not call it y because it does not match the format of y in other unmarkedFrame objects. It can also contain covariates that are recorded at the time of the observation, such as the temperature, often measured by camera traps. For example:

  site deployment             obstime  species season temperature
1    A          1 2023-12-12 08:23:42 Roe deer      1           9
2    A          1 2023-12-12 11:41:53 Roe deer      1          12
4    A          2 2023-12-12 14:42:18 Roe deer      1          21
3    A          2 2023-12-13 05:35:13 Roe deer      1          13
5    B          1 2023-12-12 15:17:34 Roe deer      1          17
7    B          1 2023-12-12 18:36:32 Roe deer      1          16
6    B          1 2023-12-13 07:09:02 Roe deer      1          17

There is one row per observation, per detection event, described in my example by three mandatory columns: site, deployment and obstime.

Other optional columns could be mandatory for certain types of models:

And columns for detection covariates recorded at the time of the observation. (:question: Although, can we even integrate this information in CT models??)

siteData

For example:

  site   habitat      elev
1    A    Forest 0.8497861
2    B Grassland 1.5632035
3    C      City 0.4787604

deploymentData

For example:

  site deployment           begintime             endtime    lat   lon camtrap_model camtrap_height
1    A          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand1           2.33
2    A          2 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand2           1.44
3    B          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand1           1.00
4    B          2 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand2           1.00
5    C          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.426 2.034        Brand1           1.28
6    C          2 2023-12-12 08:00:00                <NA> 43.426 2.034        Brand2           1.17

Detection covariates

This is the part I'm the less convinced by, it has lots of flaws but I do not have any better idea now. I also don't think I'm fully comfortable with how to integrate detection covariates in CT models, so I've probably missed important things.

obsCovsContinuous (facultative)

For continuous-time covariates (e.g. temperature, hygrometry) that can be measured at time t.

For example:

   site deployment                time temperature hygrometry
1     A          1 2023-12-12 08:00:00       19.80      77.09
2     A          1 2023-12-12 08:10:00       11.24         NA
3     A          1 2023-12-12 08:20:00       14.66         NA
4     A          1 2023-12-12 08:30:00       13.01         NA
5     A          1 2023-12-12 08:40:00       11.57         NA
6     A          1 2023-12-12 08:50:00       21.60         NA
7     A          1 2023-12-12 09:00:00        5.73      49.08
8     A          1 2023-12-12 09:10:00       14.90         NA
9     A          1 2023-12-12 09:20:00       14.14         NA
10    A          1 2023-12-12 09:30:00       22.17         NA
 [ reached 'max' / getOption("max.print") -- omitted 1184 rows ]

obsCovsBinned (facultative)

For observation covariates that are not in continuous-time but binned (e.g. rainfall is necessarily measured over an interval of time. Other environmental covariates can have an impact on detection, and if the sampling plan did not include sensors capable of measuring them, they can usually be retrieved from other data suppliers, often by day or by hour.

For example:

   site deployment           begintime             endtime hourly_rainfall
1     A          1 2023-12-12 08:00:00 2023-12-12 09:00:00            0.10
2     A          1 2023-12-12 09:00:00 2023-12-12 10:00:00            0.10
3     A          1 2023-12-12 10:00:00 2023-12-12 11:00:00            0.00
4     A          1 2023-12-12 11:00:00 2023-12-12 12:00:00            0.04
5     A          1 2023-12-12 12:00:00 2023-12-12 13:00:00            0.14
 [ reached 'max' / getOption("max.print") -- omitted 199 rows ]

:question: Things I don't like about this format

So if you have other format ideas on how to integrate detection covariates that are both user friendly and possible to integrate into models, that'll be great!

Compatibility

With the unmarkedFrame mother class

We only need to create a y matrix. This can be the number of detection per deployment (column) for each site (row). This is not data given by the user but created automatically in the function that creates the unmarkedFrameContinuous object.

  y.1 y.2
A   2   2
B   3   0
C   0   0

With other packages and tools

I think the dataframes obsData, siteData and deploymentData are easily compatible with other packages (e.g. camtrapR) and tools (e.g. Wildlife Insights exports). I don't know of formats that use detection covariates in continuous time.

kenkellner commented 6 months ago

Thanks, this is great. I'm still thinking it through but here are a few thoughts.

Just trying to poke around the model ease-of-use/flexibility trade-off here.

leapautrel commented 5 months ago

I based my proposition on relational database principles, focusing on data integrity with no-duplicates, so I’m not surprise it’s not the most user-friendly, and I absolutely agree that we should find a better trade-off between ease-of-use and flexibility!

Regarding siteCovs and deploymentData

I really focused on separating covariates impacting the ecological process (e.g. occupancy) from covariates impacting the detection process, like it's done with other unmarkedFrame objects. That's why I separated siteData (occupancy covariates) from deploymentData (detection covariates).

Combining them could be done if we include some data integrity checks that depend on the model, so if you think that splitting occupancy from detection covariates is not necessary, and that this format is more user-friendly for unmarked users, I think this is a good idea.

The check I have is mind is for static occupancy models, with occupancy state constant in each site, as those are the ones I'm looking to implement. $\psi$ the occupancy state depends on occupancy covariates (given by the user in the psiformula argument), e.g. habitat and elev here. Occupancy is associated with a site. Therefore, in those models we should check that the occupancy covariates for a given sites should always be the same. This is not necessarily true for all models: in dynamic occupancy models, occupancy covariates could change overtime.

Regarding obsCovBinned and obsCovContinuous

Definitely, I think this would make the data input much simpler. Your suggestion made me realize how to address an issue I had with my previous suggestion. I didn't explicitly explain how to link detection events to observation covariates when they don't occur at the exact same time. We could do this by adding an argument in the unmarkedFrame constructor function, to specify how to link a detection time to an observation covariate, for each observation covariate. I thought of 3 options from which the user could choose: "before", "after", "linear", as in the example below (with 2 detections at 8h40 and 9h15, and temperature measured each hour):

image

"before" and "after" would be suited for binned covariates, and "linear" for continuous covariates. The user could specify this in the function that creates the unmarkedFrame, with a names vector for example: c("hourly_rainfall"="before", "temperature"="linear", "hygrometry"="linear").

Recapitulative of unmarkedFrameContinuous data with those updates

$obsData
  site deployment             obstime  species
1    A          1 2023-12-12 08:23:42 Roe deer
2    A          1 2023-12-12 11:41:53 Roe deer
4    A          2 2023-12-12 14:42:18 Roe deer
3    A          2 2023-12-13 05:35:13 Roe deer
5    B          1 2023-12-12 15:17:34 Roe deer
7    B          1 2023-12-12 18:36:32 Roe deer
6    B          1 2023-12-13 07:09:02 Roe deer

$surveyData
  site deployment   habitat      elev           begintime             endtime    lat   lon camtrap_model camtrap_height season
1    A          1    Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand1           2.33      1
2    A          2    Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand2           1.44      1
3    B          1 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand1           1.00      1
4    B          2 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand2           1.00      1
5    C          1      City 0.4787604 2023-12-12 08:00:00 2023-12-13 18:00:00 43.426 2.034        Brand1           1.28      1
6    C          2      City 0.4787604 2023-12-12 08:00:00                <NA> 43.426 2.034        Brand2           1.17      1

$timeData
   site deployment                time temperature hygrometry hourly_rainfall
1     A          1 2023-12-12 08:00:00       21.56      31.66             0.1
2     A          1 2023-12-12 08:10:00       13.67         NA              NA
3     A          1 2023-12-12 08:20:00       17.72         NA              NA
4     A          1 2023-12-12 08:30:00       12.93         NA              NA
5     A          1 2023-12-12 08:40:00       12.62         NA              NA
6     A          1 2023-12-12 08:50:00       11.06         NA              NA
7     A          1 2023-12-12 09:00:00       12.03      56.78             0.1
...

$timeDataLink
hourly_rainfall     temperature      hygrometry 
       "before"       "linear"        "linear" 

Do you think this format is better, more user-friendly?