kaijagahm / vultureUtils

Utility functions for working with vulture data
Other
4 stars 0 forks source link

Better documentation of how get*Edges handles multiple fixes per timegroup #115

Open kaijagahm opened 1 year ago

kaijagahm commented 1 year ago

I've been able to sort of avoid this issue in the past because when timeThreshold = "10 minutes" and the GPS fix rate is also approximately 10 minutes, we can expect basically one fix per individual per timegroup.

BUT, I have sometimes noticed fixes falling slightly more or less than 10 minutes apart, leaving us with multiple fixes per individual per time group. spatsoc warns us about this:

Warning message:
In spatsoc::edge_dist(DT = test, threshold = 200, id = "id", coords = c("x",  :
  found duplicate id in a timegroup and/or splitBy - does your group_times threshold match the fix rate?

And more generally, there will often be occasions when the fix rate doesn't match the time group.

I wanted to investigate what will happen when this is the case. Will the code break? Will it return an edge only when both of the fixes for the duplicated individual fall within the distance threshold of the other individual? What if one does and one doesn't?

Desired behavior: When a timegroup has >1 fix for individual A, and 1 fix for individual B, return an edge with individual B if any of A's fixes during that timegroup are within the distance threshold of individual B. Return multiple edges, but then when calcSRI comes along, should just have a binary T/F for whether there are any edges in that timegroup at all. I think this is what the code already does, but I want to demonstrate it for sure. And then, this behavior needs to be better documented.

Process to document behavior:

  1. I grabbed three points from google maps and painstakingly converted their lat/long coordinates to UTM manually. I don't recommend this but I needed a fast way to get three points within reasonable distances.
  2. Used these to create sample data and run spatsoc::edge_dist on it directly.
test <- data.frame(x = c(365739.37, 365828.47, 365740.47), y = c(3768316.02, 3768375.20, 3768315.89), id = c("a", "a", "b"), timegroup = 1)
data.table::setDT(test)

# Run edge_dist to observe behavior, and save edges
edges <- spatsoc::edge_dist(DT = test, threshold = 200, id = "id", coords = c("x", "y"), timegroup = "timegroup")

# As expected, we get a warning. That's ok!
# Warning message:
# In spatsoc::edge_dist(DT = test, threshold = 200, id = "id", coords = c("x",  :
#  found duplicate id in a timegroup and/or splitBy - does your group_times threshold match the fix rate?

edges # observe that both edges are included, not just one

Now jump over to spaceTimeGroups and run code to remove self/reverse edges

# Remove self and duplicate edges
  edges <- edges %>%
    dplyr::filter(as.character(.data$ID1) < as.character(.data$ID2))

Now try running calcSRI on this:

 dfSRI <- calcSRI(dataset = test, edges = edges, idCol = "id")
dfSRI # as expected, we get an SRI value of 1 for a and b, unaffected by the fact that there were multiple edges.

This is good! So it seems like the only issue here is to better document this behavior.