ropensci / spatsoc

:package: spatsoc is an R package for detecting spatial and temporal groups in GPS relocations.
https://docs.ropensci.org/spatsoc
GNU General Public License v3.0
24 stars 2 forks source link

No NULL default for `timegroup` arg in `group_pts` #44

Closed kaijagahm closed 1 year ago

kaijagahm commented 1 year ago

Hi all--thanks for this great package! I'm making extensive use of it in my analysis of vulture GPS tracking data.

For most of my data, I use the recommended workflow, grouping the data first by time and then by space, using group_times and then group_pts.

But right now I have a special case where I would like to just use group_pts, since my data are already split by date and I would like to end up with single-day groups. There is no full timestamp data, just dates.

The documentation for group_pts() says this should be possible:

The timegroup argument is optional, but recommended to pair with group_times.

But when I try to use the function without specifying a value for the timegroup argument, there is no default value, so I get the error message

Error in spatsoc::group_pts(dataset, threshold = distThreshold, id = idCol,  : 
  timegroup required

I played around, and it looks like the function operates as expected if I manually pass timegroup = NULL. But I don't think I should have to do this: if timegroup is truly optional, then NULL should be the default. By extension, I think this part of the group_pts code should be removed:

if (missing(timegroup)) {
    stop('timegroup required')
  }

I know I haven't provided a full reproducible example here--I can come up with one if needed, but this seems like a pretty straightforward case of just needing to change the default value for an argument. I'm basically prepared to submit a PR, but wanted to run this by you in case there's a reason I'm missing as to why we can't allow timegroup = NULL. (And if there is some such reason, then the doc should be updated to no longer say that timegroup is optional.

Thanks for your consideration!

kaijagahm commented 1 year ago

Seems like a NULL default would also be appropriate for edge_dist(), since its docs also mention that the timegroup arg is optional.

robitalec commented 1 year ago

Hello @kaijagahm,

Good catch with the inconsistencies in the documentation - timegroup is described as being optional for group_pts in the documentation but you are right it is currently required (group_pts.R#L116-L118). The reason we would want timegroup to be required is a safeguard - if we accidentally gave group_pts a data.table of thousands of rows to calculate pairwise distances, we could quickly hit an upper limit to user's machines.

I'll get back to you shortly, I'm working on clarifying this in the docs and fixing a related issue. There is a simple workaround for already grouped data, I'll share too!

robitalec commented 1 year ago

Ok first up - the workaround:

# === Timegroups workaround -----------------------------------------------
# Alec L. Robitaille

# Packages ----------------------------------------------------------------
library(spatsoc)
library(data.table)

# Data --------------------------------------------------------------------
# Read example data
DT <- fread(system.file("extdata", "DT.csv", package = "spatsoc"))

# Prepare variables -------------------------------------------------------
# Cast the character column to POSIXct
DT[, datetime := as.POSIXct(datetime, tz = 'UTC')]

# Group times -------------------------------------------------------------
# Typically, we'd use group_times for temporal grouping
# group_times(DT, datetime = 'datetime', threshold = '20 minutes')

# Instead of using group_times, we'll make our own time group
# For example, say we wanted to group based on the day of year 
# (note this can be accomplished by group_times with a 1 day threshold too)

# Add the day of year as a column
DT[, doy := yday(datetime)]

# Select only the first row for each individual on each day, 
#  and save this to a new data.table "DT_sub"
# (Just for this example, where we need one observation per individual per timegroup)
DT_sub <- DT[, first(.SD), by = .(ID, doy)]

# Spatial groups ----------------------------------------------------------
# With custom timegroup, in this case "doy"
group_pts(
  DT_sub,
  threshold = 5,
  id = 'ID',
  coords = c('X', 'Y'),
  timegroup = 'doy'
)
#>       ID doy        X       Y            datetime population group
#>    1:  A 306 715851.4 5505340 2016-11-01 00:00:54          1     1
#>    2:  A 307 715558.2 5505518 2016-11-02 00:00:54          1     2
#>    3:  A 308 714955.6 5505223 2016-11-03 00:01:42          1     3
#>    4:  A 309 713800.1 5505434 2016-11-04 00:01:23          1     4
#>    5:  A 310 715091.4 5505312 2016-11-05 00:01:44          1     5
#>   ---                                                             
#> 1196:  J  55 697605.8 5511058 2017-02-24 00:00:48          1  1165
#> 1197:  J  56 697783.8 5511689 2017-02-25 00:00:55          1  1166
#> 1198:  J  57 697835.3 5510694 2017-02-26 00:00:48          1   594
#> 1199:  J  58 700692.0 5508661 2017-02-27 00:00:54          1  1167
#> 1200:  J  59 700684.6 5508745 2017-02-28 00:00:48          1  1168

Created on 2023-02-01 with reprex v2.0.2

So we are using a custom field, in this case "doy", and since the example data used has multiple observations per individual per day, we subset only the first of each day. This kind of workaround might be useful for focal observations where we have observation ids or something similar. Out of interest, what's your use case that already has time groups defined?

robitalec commented 1 year ago

Second, the documentation inconsistencies with the functions checks: #46

I changed the manual to clarify that timegroups is strongly recommended for group_pts, edge_nn and edge_dist but doesn't necessarily need to come directly from group_times. Let me know if you think that is clear, and reasonable (given the above workaround as well)

https://robitalec.r-universe.dev/spatsoc/doc/manual.html#group_pts

robitalec commented 1 year ago

Lastly, if you really wanted to ignore timegroups or set it to a ~NULL equivalent, you could force spatsoc to consider all rows in the same timegroup using a constant across all rows. I'm not really sure where this would be relevant, but in case it is, here I make up some fake data and group it, using a timegroup set to 1 for all rows.

# === One timegroup -------------------------------------------------------
# Alec L. Robitaille

# Load packages -----------------------------------------------------------
library(data.table)
library(spatsoc)

# Fake data ---------------------------------------------------------------
DT <- CJ(id = LETTERS[1:5], 
         x = 1, 
         y = 1)

DT[, x := x + sample(seq.int(5), .N, replace = TRUE)]
DT[, y := y + sample(seq.int(5), .N, replace = TRUE)]

print(DT)
#>    id x y
#> 1:  A 2 6
#> 2:  B 5 6
#> 3:  C 6 3
#> 4:  D 2 6
#> 5:  E 2 6

# Set timegroup -----------------------------------------------------------
# Set a timegroup for all rows to the same value
DT[, timegroup := 1]

# Group points ------------------------------------------------------------
group_pts(
  DT,
  threshold = 1,
  id = 'id',
  coords = c('x', 'y'),
  timegroup = 'timegroup'
)
#>    id x y timegroup group
#> 1:  A 2 6         1     1
#> 2:  B 5 6         1     2
#> 3:  C 6 3         1     3
#> 4:  D 2 6         1     1
#> 5:  E 2 6         1     1

Created on 2023-02-02 with reprex v2.0.2

Let me know how this sounds and if it works for your particular use case. I really appreciate you opening the issue and highlighting this inconsistency!

(Edit: missed library(spatsoc) call in reprex)

robitalec commented 1 year ago

Closing this but please let me know if you have other comments or concerns and we can reopen

kaijagahm commented 1 year ago

Thanks! Oops, I had missed the part at the end where you asked me to let you know if it worked. I think I've got it working now, and I appreciate very much your thorough response.