`unmarkedFrame` class for continuous-time models (`unmarkedFrameContinuous`)

leapautrel commented 6 months ago

Hi! For continuous-time models, we need a new unmarkedFrame class. I thought that we could create a subclass of unmarkedFrame (e.g. unmarkedFrameContinuous) as a base, and then add subclasses of this class for specific models. Here are my proposed specifications for this class.

What data do we need?

For each detection event, we absolutely need:

The time of the detection
The site id
The deployment id. A deployment = a unique spatial and temporal placement of a sensor with uninterrupted data recording. This information if useful if there are several deployments per site, e.g. the ARUs were only switched on at night (1 night = 1 deployment); the camtrap stopped for a week because its battery died; two camtraps were set up in the same site...
The beginning and the end of the deployment. If they are not known (e.g. the battery died), we can approximate them by the time of the first and last of the first trigger (e.g. first and last photo for a camtrap), all species confounded. If the 1st or the last detection is of the species of interest, this changes the likelihood, so we should keep this information.

Depending on the model, we could also need other informations: the species, the season...

Data provided by the user

In my proposition, here are the data the user should provide to create an unmarkedFrameContinuous object. I split them in several dataframes as this seems to be the most logical and safe (relational database-like) and is actually how sensor data are organised in tools I know of (such as the camtrapR R package and the Wildlife Insights exports)

`obsData`

obsData contains the observation data. I do not call it y because it does not match the format of y in other unmarkedFrame objects. It can also contain covariates that are recorded at the time of the observation, such as the temperature, often measured by camera traps. For example:

  site deployment             obstime  species season temperature
1    A          1 2023-12-12 08:23:42 Roe deer      1           9
2    A          1 2023-12-12 11:41:53 Roe deer      1          12
4    A          2 2023-12-12 14:42:18 Roe deer      1          21
3    A          2 2023-12-13 05:35:13 Roe deer      1          13
5    B          1 2023-12-12 15:17:34 Roe deer      1          17
7    B          1 2023-12-12 18:36:32 Roe deer      1          16
6    B          1 2023-12-13 07:09:02 Roe deer      1          17

There is one row per observation, per detection event, described in my example by three mandatory columns: site, deployment and obstime.

site is a required column
deployment is required in the object data but could be created by default to be optional for the user, as I think that there are many cases with only one deployment per site.
obstime is a required column

Other optional columns could be mandatory for certain types of models:

species. Because sensors collect data for many species at a time, I think this column can be useful to determine the beginning/end of the deployment if it is unknown. For multi-species models, the species column can be mandatory.
season for multi-season models

And columns for detection covariates recorded at the time of the observation. (:question: Although, can we even integrate this information in CT models??)

`siteData`

The primary key of this dataframe is the column site
siteData contains the site covariates for the ecological submodel.
siteData list all the sites in the study (if there are no detection of the target species in a site, it can be absent from the obsData dataframe)

For example:

  site   habitat      elev
1    A    Forest 0.8497861
2    B Grassland 1.5632035
3    C      City 0.4787604

`deploymentData`

The primary key of this dataframe is the deployment in a site, so both columns site and deployment
deploymentData contains the time of the beginning and of the end of a deployment. begintime and endtime are mandatory.
It can contain information about the location of a deployment (e.g. longitude and latitude)
It can contain detection covariates that are related to the deployment (e.g. the model of the sensor, how it was parametered, how it was set up, what environment a camera trap is facing, ...)
It lists all the deployment in the study (even if there were no detection events in this deployment, so obsData do not include this deployment)

For example:

  site deployment           begintime             endtime    lat   lon camtrap_model camtrap_height
1    A          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand1           2.33
2    A          2 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand2           1.44
3    B          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand1           1.00
4    B          2 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand2           1.00
5    C          1 2023-12-12 08:00:00 2023-12-13 18:00:00 43.426 2.034        Brand1           1.28
6    C          2 2023-12-12 08:00:00                <NA> 43.426 2.034        Brand2           1.17

Detection covariates

This is the part I'm the less convinced by, it has lots of flaws but I do not have any better idea now. I also don't think I'm fully comfortable with how to integrate detection covariates in CT models, so I've probably missed important things.

`obsCovsContinuous` (facultative)

For continuous-time covariates (e.g. temperature, hygrometry) that can be measured at time t.

The primary key for this table are the site, deployment, and the time t of the measure
More columns are added for the covariates

For example:

   site deployment                time temperature hygrometry
1     A          1 2023-12-12 08:00:00       19.80      77.09
2     A          1 2023-12-12 08:10:00       11.24         NA
3     A          1 2023-12-12 08:20:00       14.66         NA
4     A          1 2023-12-12 08:30:00       13.01         NA
5     A          1 2023-12-12 08:40:00       11.57         NA
6     A          1 2023-12-12 08:50:00       21.60         NA
7     A          1 2023-12-12 09:00:00        5.73      49.08
8     A          1 2023-12-12 09:10:00       14.90         NA
9     A          1 2023-12-12 09:20:00       14.14         NA
10    A          1 2023-12-12 09:30:00       22.17         NA
 [ reached 'max' / getOption("max.print") -- omitted 1184 rows ]

`obsCovsBinned` (facultative)

For observation covariates that are not in continuous-time but binned (e.g. rainfall is necessarily measured over an interval of time. Other environmental covariates can have an impact on detection, and if the sampling plan did not include sensors capable of measuring them, they can usually be retrieved from other data suppliers, often by day or by hour.

The primary key for this table are the site, deployment, and the time bin (here fully defined by begintime and endtime but this is not ideal and I'm sure it could be simplified)
More columns are added for the covariates
If some covariates are defined per hour and other per day, obsCovsBinned must be a list of two dataframes

For example:

   site deployment           begintime             endtime hourly_rainfall
1     A          1 2023-12-12 08:00:00 2023-12-12 09:00:00            0.10
2     A          1 2023-12-12 09:00:00 2023-12-12 10:00:00            0.10
3     A          1 2023-12-12 10:00:00 2023-12-12 11:00:00            0.00
4     A          1 2023-12-12 11:00:00 2023-12-12 12:00:00            0.04
5     A          1 2023-12-12 12:00:00 2023-12-12 13:00:00            0.14
 [ reached 'max' / getOption("max.print") -- omitted 199 rows ]

:question: Things I don't like about this format

It can take some time for the user to format
This needs lots of verifications and tests to make sure we know how to match data and integrate them into models.
How these data are integrated into model covariates require lots of decision making that should probably not be opaque for the user (especially for obsDataContinuous)

So if you have other format ideas on how to integrate detection covariates that are both user friendly and possible to integrate into models, that'll be great!

Compatibility

With the `unmarkedFrame` mother class

We only need to create a y matrix. This can be the number of detection per deployment (column) for each site (row). This is not data given by the user but created automatically in the function that creates the unmarkedFrameContinuous object.

With other packages and tools

I think the dataframes obsData, siteData and deploymentData are easily compatible with other packages (e.g. camtrapR) and tools (e.g. Wildlife Insights exports). I don't know of formats that use detection covariates in continuous time.

kenkellner commented 6 months ago

Thanks, this is great. I'm still thinking it through but here are a few thoughts.

I agree that it makes sense to make a base version of this frame and then make child versions for the specific functions
I am not sure that we need to explicitly distinguish siteCovs from deploymentData. I think we could get away with just siteCovs (or some other name) recognizing that we may have two or more rows of this data frame technically at the same geographic "site". This is already a pretty common situation for unmarked users ("stacked" datasets). And my understanding of these models is in the actual likelihood there aren't two separate "site" and "deployment" layers anyway - deployments are the unit of replication. But I very well might be missing something here. "Site" could be included as a covariate in this data frame to account for repeated deployments at the same site, which could be handled with a random effect if we implement that.
Instead of having obsCovBinned could users just record these covariates in the same way as obsCovsContinuous, i.e. at each 1 hour timestamp record the amount of rain in the previous hour?

Just trying to poke around the model ease-of-use/flexibility trade-off here.

leapautrel commented 5 months ago

I based my proposition on relational database principles, focusing on data integrity with no-duplicates, so I’m not surprise it’s not the most user-friendly, and I absolutely agree that we should find a better trade-off between ease-of-use and flexibility!

Regarding `siteCovs` and `deploymentData`

I really focused on separating covariates impacting the ecological process (e.g. occupancy) from covariates impacting the detection process, like it's done with other unmarkedFrame objects. That's why I separated siteData (occupancy covariates) from deploymentData (detection covariates).

Combining them could be done if we include some data integrity checks that depend on the model, so if you think that splitting occupancy from detection covariates is not necessary, and that this format is more user-friendly for unmarked users, I think this is a good idea.

The check I have is mind is for static occupancy models, with occupancy state constant in each site, as those are the ones I'm looking to implement. $\psi$ the occupancy state depends on occupancy covariates (given by the user in the psiformula argument), e.g. habitat and elev here. Occupancy is associated with a site. Therefore, in those models we should check that the occupancy covariates for a given sites should always be the same. This is not necessarily true for all models: in dynamic occupancy models, occupancy covariates could change overtime.

Regarding `obsCovBinned` and `obsCovContinuous`

Definitely, I think this would make the data input much simpler. Your suggestion made me realize how to address an issue I had with my previous suggestion. I didn't explicitly explain how to link detection events to observation covariates when they don't occur at the exact same time. We could do this by adding an argument in the unmarkedFrame constructor function, to specify how to link a detection time to an observation covariate, for each observation covariate. I thought of 3 options from which the user could choose: "before", "after", "linear", as in the example below (with 2 detections at 8h40 and 9h15, and temperature measured each hour):

"before" and "after" would be suited for binned covariates, and "linear" for continuous covariates. The user could specify this in the function that creates the unmarkedFrame, with a names vector for example: c("hourly_rainfall"="before", "temperature"="linear", "hygrometry"="linear").

Recapitulative of unmarkedFrameContinuous data with those updates

If we homogeneise the time-variable covariates, we should remove the covariates that are recorded at the time of the observation (temperature was present in obsData in my example above). Other informations also move (e.g. season is associated to the deployment, not the observation)
The site and deployment data are combined. I propose surveyData for its name, as I feel it can encompass both site data and sensor deployment details. But I don't mind using deploymentData or sensorData or another name if you prefer. I have some reserves for siteData, as I feel like this would be too close to siteCovs in other unmarked models, which does not include information about how the survey was done (notably begintime and endtime, which are unrelevant for discrete data).
The covariates that vary over time are combined into one dataframe. Because surveyData contains observation covariates (e.g. the camera trap model), I suggest timeData (for consistency with other names) but I don't mind using obsCovs either.
We add in timeDataLink, a named vector, and all observation covariates that are not present in this vector are linked to detection time by the default option.

$obsData
  site deployment             obstime  species
1    A          1 2023-12-12 08:23:42 Roe deer
2    A          1 2023-12-12 11:41:53 Roe deer
4    A          2 2023-12-12 14:42:18 Roe deer
3    A          2 2023-12-13 05:35:13 Roe deer
5    B          1 2023-12-12 15:17:34 Roe deer
7    B          1 2023-12-12 18:36:32 Roe deer
6    B          1 2023-12-13 07:09:02 Roe deer

$surveyData
  site deployment   habitat      elev           begintime             endtime    lat   lon camtrap_model camtrap_height season
1    A          1    Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand1           2.33      1
2    A          2    Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076        Brand2           1.44      1
3    B          1 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand1           1.00      1
4    B          2 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092        Brand2           1.00      1
5    C          1      City 0.4787604 2023-12-12 08:00:00 2023-12-13 18:00:00 43.426 2.034        Brand1           1.28      1
6    C          2      City 0.4787604 2023-12-12 08:00:00                <NA> 43.426 2.034        Brand2           1.17      1

$timeData
   site deployment                time temperature hygrometry hourly_rainfall
1     A          1 2023-12-12 08:00:00       21.56      31.66             0.1
2     A          1 2023-12-12 08:10:00       13.67         NA              NA
3     A          1 2023-12-12 08:20:00       17.72         NA              NA
4     A          1 2023-12-12 08:30:00       12.93         NA              NA
5     A          1 2023-12-12 08:40:00       12.62         NA              NA
6     A          1 2023-12-12 08:50:00       11.06         NA              NA
7     A          1 2023-12-12 09:00:00       12.03      56.78             0.1
...

$timeDataLink
hourly_rainfall     temperature      hygrometry 
       "before"       "linear"        "linear"

Do you think this format is better, more user-friendly?

rbchan / unmarked