jmsigner / amt

37 stars 13 forks source link

Mismatching units when calculating dt_ column in steps() function #1

Closed jballuffi closed 6 years ago

jballuffi commented 6 years ago

Hello, I have been working with your AMT package to run a step selection function on some animal movement data recorded from GPS collars. The package has been incredibly useful and your guide, Fitting Step-Selection Functions with amt, was very easy to follow. However, I have come across a problem within the steps() function in regards to the dt calculations. I have been running the function across a dataframe which includes multiple individuals and the units for the dt calculations seem vary between the individuals. I have found a quick solution to this problem (listed below), but I thought I would bring this to your attention because some users might not notice such a thing occurring, especially when working with very large data sets.

Expected: when running a function which includes track() and steps() on a multi-individual dataframe by=id, which uses a datetime for dt calculations, function returns a dt column with units same across all individuals.

Actual: steps() function returns a dt_ column with hours as the unit for some individuals, and minutes as the unit for others.

Example on how to reproduce problem:

##rewrite track() and steps() into one function called stepsfunction()
stepsfunction<- function(x.col, y.col, date.col) {
  trk <- track(x.col, y.col, date.col) %>% 
    steps()
}
##run stepsfunction() on a dt with multiple individuals labeled with an "ID" using "datatable" package to run by=ID
steps <- dt[, stepsfunction( 8, x.col = EASTING, y.col = NORTHING, date.col = datetime), 
         by = ID]
##return the ranges of calculated dt_ to look for noticeable differences between individuals
##in this case we should expect the minimum dt_ to be ~2hr because data come from collars with 2hr fix rates. 
knitr::kable(locs[, range(dt_), by = ID])

output: M003 and M009 show to be problematic. When investigated in detail, calculations were done in minutes.

ID V1
M002 1.965556 hours
M002 66.031944 hours
M003 29.350000 hours
M003 1590.483333 hours
M004 1.975556 hours
M004 35.999722 hours
M005 1.966667 hours
M005 22.000000 hours
M006 1.979167 hours
M006 111.999722 hours
M008 1.983333 hours
M008 37.983333 hours
M009 29.316667 hours
M009 2279.333333 hours

How to fix problem:

##add line which replaces dt_ column with manually imputed difftime() indicating specific unit of time
stepsfunction<- function(x.col, y.col, date.col) {
  trk <- track(x.col, y.col, date.col) %>% 
    steps()
    trk$dt_ <- difftime(trk$t2_, trk$t1_, unit='hours')
}
steps <- dt[, stepsfunction( 8, x.col = EASTING, y.col = NORTHING, date.col = datetime), 
         by = ID]
knitr::kable(locs[, range(dt_), by = ID])
ID V1
M002 1.9655556 hours
M002 66.0319444 hours
M003 0.4891667 hours
M003 26.5080556 hours
M004 1.9755556 hours
M004 35.9997222 hours
M005 1.9666667 hours
M005 22.0000000 hours
M006 1.9791667 hours
M006 111.9997222 hours
M008 1.9833333 hours
M008 37.9833333 hours
M009 0.4886111 hours
M009 37.9888889 hours

Session info: R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] bindrcpp_0.2 data.table_1.10.4-3 amt_0.0.2.0 survival_2.41-3
[5] forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4
[9] readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1
[13] tidyverse_1.2.1 raster_2.6-7 sp_1.2-7 lubridate_1.7.2

loaded via a namespace (and not attached): [1] reshape2_1.4.3 splines_3.4.3 haven_1.1.1 lattice_0.20-35
[5] colorspace_1.3-2 yaml_2.1.17 utf8_1.1.3 rlang_0.2.0
[9] pillar_1.2.1 fitdistrplus_1.0-9 foreign_0.8-69 glue_1.2.0
[13] modelr_0.1.1 readxl_1.0.0 bindr_0.1 plyr_1.8.4
[17] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
[21] mvtnorm_1.0-7 psych_1.7.8 labeling_0.3 parallel_3.4.3
[25] broom_0.4.3 Rcpp_0.12.15 scales_0.5.0 jsonlite_1.5
[29] mnormt_1.5-5 hms_0.4.2 stringi_1.1.6 grid_3.4.3
[33] rgdal_1.2-16 cli_1.0.0 tools_3.4.3 magrittr_1.5
[37] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.1 MASS_7.3-47
[41] Matrix_1.2-12 xml2_1.2.0 assertthat_0.2.0 httr_1.3.1
[45] rstudioapi_0.7 boot_1.3-20 R6_2.2.2 circular_0.4-93
[49] nlme_3.1-131 compiler_3.4.3

jmsigner commented 6 years ago

Thanks @jballuffi for reporting and proposing a solution. I will look into it and report back.

jmsigner commented 6 years ago

amt now uses difftime for the steps function. By default units = 'auto', but steps gained a new argument diff_time_units where the units can be specified. The updated version of the package is on github and submited to CRAN.

I saw you were using data.table with amt. Did this work smoothly? I haven't tried this at all.

jmsigner commented 6 years ago

@jballuffi A new version of amt is now on CRAN.

jballuffi commented 6 years ago

Great, I will start using the new package.

Yes, datatable has worked well with the package. I used it as a way to run the functions by "ID". As far as I could tell this wasn't possible within the amt package so I turned to running the functions within datatable. However, I did have to manually create a new column with animal "ID" + "stepsid" as this generated column had repeated values across individuals and would not work in the logistic regression.

jmsigner commented 6 years ago

@jballuffi, you should be able to achieve the same with dplyr using list columns:

Something like this should work (untested)

dat %>% make_track(x, y, t, id = id) %>% nest(-id) %>%
mutate(ssf = map(data, function(x) x %>% steps() %>% random_steps() %>% extract_covariates() %>% fit_ssf()))

Which will create a new column in dat called ssf that contains the results of a SSF.

jballuffi commented 6 years ago

@jmsigner thank you for this code!