Open leapautrel opened 6 months ago
Thanks, this is great. I'm still thinking it through but here are a few thoughts.
siteCovs
from deploymentData
. I think we could get away with just siteCovs
(or some other name) recognizing that we may have two or more rows of this data frame technically at the same geographic "site". This is already a pretty common situation for unmarked
users ("stacked" datasets). And my understanding of these models is in the actual likelihood there aren't two separate "site" and "deployment" layers anyway - deployments are the unit of replication. But I very well might be missing something here. "Site" could be included as a covariate in this data frame to account for repeated deployments at the same site, which could be handled with a random effect if we implement that.obsCovBinned
could users just record these covariates in the same way as obsCovsContinuous
, i.e. at each 1 hour timestamp record the amount of rain in the previous hour?Just trying to poke around the model ease-of-use/flexibility trade-off here.
I based my proposition on relational database principles, focusing on data integrity with no-duplicates, so I’m not surprise it’s not the most user-friendly, and I absolutely agree that we should find a better trade-off between ease-of-use and flexibility!
siteCovs
and deploymentData
I really focused on separating covariates impacting the ecological process (e.g. occupancy) from covariates impacting the detection process, like it's done with other unmarkedFrame objects. That's why I separated siteData
(occupancy covariates) from deploymentData
(detection covariates).
Combining them could be done if we include some data integrity checks that depend on the model, so if you think that splitting occupancy from detection covariates is not necessary, and that this format is more user-friendly for unmarked users, I think this is a good idea.
The check I have is mind is for static occupancy models, with occupancy state constant in each site, as those are the ones I'm looking to implement. $\psi$ the occupancy state depends on occupancy covariates (given by the user in the psiformula
argument), e.g. habitat
and elev
here. Occupancy is associated with a site. Therefore, in those models we should check that the occupancy covariates for a given sites should always be the same. This is not necessarily true for all models: in dynamic occupancy models, occupancy covariates could change overtime.
obsCovBinned
and obsCovContinuous
Definitely, I think this would make the data input much simpler. Your suggestion made me realize how to address an issue I had with my previous suggestion. I didn't explicitly explain how to link detection events to observation covariates when they don't occur at the exact same time. We could do this by adding an argument in the unmarkedFrame constructor function, to specify how to link a detection time to an observation covariate, for each observation covariate. I thought of 3 options from which the user could choose: "before", "after", "linear", as in the example below (with 2 detections at 8h40 and 9h15, and temperature measured each hour):
"before" and "after" would be suited for binned covariates, and "linear" for continuous covariates. The user could specify this in the function that creates the unmarkedFrame, with a names vector for example: c("hourly_rainfall"="before", "temperature"="linear", "hygrometry"="linear")
.
obsData
in my example above). Other informations also move (e.g. season
is associated to the deployment, not the observation)surveyData
for its name, as I feel it can encompass both site data and sensor deployment details. But I don't mind using deploymentData
or sensorData
or another name if you prefer. I have some reserves for siteData
, as I feel like this would be too close to siteCovs
in other unmarked models, which does not include information about how the survey was done (notably begintime and endtime, which are unrelevant for discrete data).surveyData
contains observation covariates (e.g. the camera trap model), I suggest timeData
(for consistency with other names) but I don't mind using obsCovs
either.timeDataLink
, a named vector, and all observation covariates that are not present in this vector are linked to detection time by the default option.$obsData
site deployment obstime species
1 A 1 2023-12-12 08:23:42 Roe deer
2 A 1 2023-12-12 11:41:53 Roe deer
4 A 2 2023-12-12 14:42:18 Roe deer
3 A 2 2023-12-13 05:35:13 Roe deer
5 B 1 2023-12-12 15:17:34 Roe deer
7 B 1 2023-12-12 18:36:32 Roe deer
6 B 1 2023-12-13 07:09:02 Roe deer
$surveyData
site deployment habitat elev begintime endtime lat lon camtrap_model camtrap_height season
1 A 1 Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076 Brand1 2.33 1
2 A 2 Forest 0.8497861 2023-12-12 08:00:00 2023-12-13 18:00:00 43.442 2.076 Brand2 1.44 1
3 B 1 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092 Brand1 1.00 1
4 B 2 Grassland 1.5632035 2023-12-12 08:00:00 2023-12-13 18:00:00 43.453 2.092 Brand2 1.00 1
5 C 1 City 0.4787604 2023-12-12 08:00:00 2023-12-13 18:00:00 43.426 2.034 Brand1 1.28 1
6 C 2 City 0.4787604 2023-12-12 08:00:00 <NA> 43.426 2.034 Brand2 1.17 1
$timeData
site deployment time temperature hygrometry hourly_rainfall
1 A 1 2023-12-12 08:00:00 21.56 31.66 0.1
2 A 1 2023-12-12 08:10:00 13.67 NA NA
3 A 1 2023-12-12 08:20:00 17.72 NA NA
4 A 1 2023-12-12 08:30:00 12.93 NA NA
5 A 1 2023-12-12 08:40:00 12.62 NA NA
6 A 1 2023-12-12 08:50:00 11.06 NA NA
7 A 1 2023-12-12 09:00:00 12.03 56.78 0.1
...
$timeDataLink
hourly_rainfall temperature hygrometry
"before" "linear" "linear"
Do you think this format is better, more user-friendly?
Hi! For continuous-time models, we need a new
unmarkedFrame
class. I thought that we could create a subclass ofunmarkedFrame
(e.g.unmarkedFrameContinuous
) as a base, and then add subclasses of this class for specific models. Here are my proposed specifications for this class.What data do we need?
For each detection event, we absolutely need:
The time of the detection
The site id
The deployment id. A deployment = a unique spatial and temporal placement of a sensor with uninterrupted data recording. This information if useful if there are several deployments per site, e.g. the ARUs were only switched on at night (1 night = 1 deployment); the camtrap stopped for a week because its battery died; two camtraps were set up in the same site...
The beginning and the end of the deployment. If they are not known (e.g. the battery died), we can approximate them by the time of the first and last of the first trigger (e.g. first and last photo for a camtrap), all species confounded. If the 1st or the last detection is of the species of interest, this changes the likelihood, so we should keep this information.
Depending on the model, we could also need other informations: the species, the season...
Data provided by the user
In my proposition, here are the data the user should provide to create an
unmarkedFrameContinuous
object. I split them in several dataframes as this seems to be the most logical and safe (relational database-like) and is actually how sensor data are organised in tools I know of (such as the camtrapR R package and the Wildlife Insights exports)obsData
obsData
contains the observation data. I do not call ity
because it does not match the format ofy
in other unmarkedFrame objects. It can also contain covariates that are recorded at the time of the observation, such as the temperature, often measured by camera traps. For example:There is one row per observation, per detection event, described in my example by three mandatory columns: site, deployment and obstime.
Other optional columns could be mandatory for certain types of models:
And columns for detection covariates recorded at the time of the observation. (:question: Although, can we even integrate this information in CT models??)
siteData
site
siteData
contains the site covariates for the ecological submodel.siteData
list all the sites in the study (if there are no detection of the target species in a site, it can be absent from theobsData
dataframe)For example:
deploymentData
site
anddeployment
deploymentData
contains the time of the beginning and of the end of a deployment.begintime
andendtime
are mandatory.obsData
do not include this deployment)For example:
Detection covariates
This is the part I'm the less convinced by, it has lots of flaws but I do not have any better idea now. I also don't think I'm fully comfortable with how to integrate detection covariates in CT models, so I've probably missed important things.
obsCovsContinuous
(facultative)For continuous-time covariates (e.g. temperature, hygrometry) that can be measured at time t.
site
,deployment
, and the time t of the measureFor example:
obsCovsBinned
(facultative)For observation covariates that are not in continuous-time but binned (e.g. rainfall is necessarily measured over an interval of time. Other environmental covariates can have an impact on detection, and if the sampling plan did not include sensors capable of measuring them, they can usually be retrieved from other data suppliers, often by day or by hour.
site
,deployment
, and the time bin (here fully defined bybegintime
andendtime
but this is not ideal and I'm sure it could be simplified)obsCovsBinned
must be a list of two dataframesFor example:
:question: Things I don't like about this format
obsDataContinuous
)So if you have other format ideas on how to integrate detection covariates that are both user friendly and possible to integrate into models, that'll be great!
Compatibility
With the
unmarkedFrame
mother classWe only need to create a
y
matrix. This can be the number of detection per deployment (column) for each site (row). This is not data given by the user but created automatically in the function that creates theunmarkedFrameContinuous
object.With other packages and tools
I think the dataframes
obsData
,siteData
anddeploymentData
are easily compatible with other packages (e.g.camtrapR
) and tools (e.g. Wildlife Insights exports). I don't know of formats that use detection covariates in continuous time.