matiasandina / ggethos

https://matiasandina.github.io/ggethos/
Other
1 stars 1 forks source link

`NA`s and `rle()` #11

Closed matiasandina closed 1 year ago

matiasandina commented 2 years ago

from ?rle

Missing values are regarded as unequal to the previous value, even if that is also **missing**

This is potentially a problem with aggregating NAs. We don't see this most of the time because the default is to filter NAs out.

Potential options are:

matiasandina commented 1 year ago

This seems to address #12 and #6. Maybe there's an issue with the handling of factors/datetimes in different versions of ggplot2 (and I might be running different version in different machines). I can confirm this is working on version 3.4.0 while errors where experienced using 3.3.5.

A remaining issue in this PR is the strategy of using split on the y axis. When using anything other than wombat in the y (e.g., behaviour, trial), this would group wombats together (unless using facets).

Still, when we separate using facet_wrap(), the behaviors are bound together from min to max.

ggplot(aes(y=behaviour, 
           color=factor(behaviour), 
           behaviour=factor(behaviour))) + 
  geom_ethogram() +
  facet_wrap(wombat~trial)

I think this is tied to the way the data is split and the function calls. I will show what I think it's problematic with a simpler example from gimli dataset.

gimli <- wombats |>
  filter(wombat == "gimli")

lapply(split(gimli, gimli$behaviour),
       vctrs::vec_identify_runs)

$digging
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
attr(,"n")
[1] 48

$pondering
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79
attr(,"n")
[1] 79

$snuffling
[1] 1 2 3 4 5
attr(,"n")
[1] 5

We later call min max, which will give us something akin to calling

lapply(split(gimli, gimli$behaviour),
       function(tt) range(vctrs::vec_identify_runs(tt)))

$digging
[1]  1 48

$pondering
[1]  1 79

$snuffling
[1] 1 5

And this is exactly what we get with the geom_ethogram() call

ggplot(gimli, aes(y=behaviour, behaviour=behaviour, color=behaviour)) + geom_ethogram()

image

I think we should probably consider switching to using vec_identify_runs() or something like data.table::rleid or the new dplyr::consecutive_id() with this approach.

gimli %>% 
  mutate(run_id = vctrs::vec_identify_runs(behaviour)) %>% 
  group_by(run_id) %>% 
  summarise(behaviour=unique(behaviour), x=first(frame), xend=last(frame)) %>% 
  ggplot(aes(x=x, xend=xend, y=behaviour, behaviour=behaviour)) + geom_ethogram()

# Will produce same as above
gimli %>% 
mutate(run_id = data.table::rleid(behaviour)) %>% 
group_by(run_id) %>% 
summarise(behaviour=unique(behaviour), x=first(frame), xend=last(frame)) %>% 
ggplot(aes(x=x, xend=xend, y=behaviour, behaviour=behaviour)) + geom_ethogram()

image

There are other issues with this approach (i.e., we have to be cautious about grouping the data correctly before calling summarise), but I think that the math being explicitly done is less prone to unexpected behavior.

I will give it a try soon to see if we could have a working alternative solution.

matiasandina commented 1 year ago

@wilkox pushed a dplyr branch, could you please check it? I know it adds dependencies but I think it might be more robust help us move forward with a release. Tests are failing because I didn't change the files, I might run it later on merge if you think the changes are OK.

matiasandina commented 1 year ago

The dplyr branch is currently missing xend = last(x) + sampling_interval to account for the period between the last observation and the first observation of the next behavior, when computing ethograms.

 data <- data %>% dplyr::mutate(x = seq_along(group)) %>% 
      dplyr::mutate(run_id = vctrs::vec_identify_runs(behaviour)) %>% 
      dplyr::group_by(group, run_id) %>% dplyr::summarise(behaviour = unique(behaviour), 
      xend = last(x), x = first(x), y = unique(y), yend = unique(yend), 
      PANEL = unique(PANEL), colour = unique(colour), 
      .groups = "keep")

It produces "cute" behavior by leaving black the space between the last observation of one behavior and the next one

image

But I think we should be accurate and add the sampling interval to the last

matiasandina commented 1 year ago

Given #16, I think this is behaving now and we should move forward :smile: