`NA`s and `rle()` - Githubissues

matiasandina commented 2 years ago

from ?rle

Missing values are regarded as unequal to the previous value, even if that is also **missing**

This is potentially a problem with aggregating NAs. We don't see this most of the time because the default is to filter NAs out.

Potential options are:

scale_y_discrete (see here) has na.translate and na.value If na.translate = TRUE, what aesthetic value should the missing values be displayed as? Does not apply to position scales where NA is always placed at the far right. Maybe we can tap into that and coerce values into something usable
Coerce NA to something manually? Not sure this would solve the issue though or be better than ggplot internals.
Use some hack like rle(is.na(c("A", "A", NA_character_, NA_character_, "R"))) to reconstruct the instances where whe have NAs. Could solve it, but likely to add more buggy indexing
Use this rle function instead https://rdrr.io/github/earowang/mists/man/na-rle.html
- Pros: It might solve the issue with minimal problems
- Cons: Adding a dependency from a package that might not be extensively mantained
Change to dplyr
- Pros: Way more flexible. We re-write part of the code and improve readability. We could provide a function for people to calculate what they want before calling geom_ethogram(). Something like ggethos::calculate_ethogram(data) %>% ggplot(aes(...)) + geom_segment()
- Cons: A lot more dependencies.

matiasandina commented 1 year ago

This seems to address #12 and #6. Maybe there's an issue with the handling of factors/datetimes in different versions of ggplot2 (and I might be running different version in different machines). I can confirm this is working on version 3.4.0 while errors where experienced using 3.3.5.

A remaining issue in this PR is the strategy of using split on the y axis. When using anything other than wombat in the y (e.g., behaviour, trial), this would group wombats together (unless using facets).

Still, when we separate using facet_wrap(), the behaviors are bound together from min to max.

ggplot(aes(y=behaviour, 
           color=factor(behaviour), 
           behaviour=factor(behaviour))) + 
  geom_ethogram() +
  facet_wrap(wombat~trial)

I think this is tied to the way the data is split and the function calls. I will show what I think it's problematic with a simpler example from gimli dataset.

gimli <- wombats |>
  filter(wombat == "gimli")

lapply(split(gimli, gimli$behaviour),
       vctrs::vec_identify_runs)

$digging
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
attr(,"n")
[1] 48

$pondering
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79
attr(,"n")
[1] 79

$snuffling
[1] 1 2 3 4 5
attr(,"n")
[1] 5

We later call min max, which will give us something akin to calling

lapply(split(gimli, gimli$behaviour),
       function(tt) range(vctrs::vec_identify_runs(tt)))

$digging
[1]  1 48

$pondering
[1]  1 79

$snuffling
[1] 1 5

And this is exactly what we get with the geom_ethogram() call

ggplot(gimli, aes(y=behaviour, behaviour=behaviour, color=behaviour)) + geom_ethogram()

I think we should probably consider switching to using vec_identify_runs() or something like data.table::rleid or the new dplyr::consecutive_id() with this approach.

gimli %>% 
  mutate(run_id = vctrs::vec_identify_runs(behaviour)) %>% 
  group_by(run_id) %>% 
  summarise(behaviour=unique(behaviour), x=first(frame), xend=last(frame)) %>% 
  ggplot(aes(x=x, xend=xend, y=behaviour, behaviour=behaviour)) + geom_ethogram()

# Will produce same as above
gimli %>% 
mutate(run_id = data.table::rleid(behaviour)) %>% 
group_by(run_id) %>% 
summarise(behaviour=unique(behaviour), x=first(frame), xend=last(frame)) %>% 
ggplot(aes(x=x, xend=xend, y=behaviour, behaviour=behaviour)) + geom_ethogram()

There are other issues with this approach (i.e., we have to be cautious about grouping the data correctly before calling summarise), but I think that the math being explicitly done is less prone to unexpected behavior.

I will give it a try soon to see if we could have a working alternative solution.

matiasandina commented 1 year ago

@wilkox pushed a dplyr branch, could you please check it? I know it adds dependencies but I think it might be more robust help us move forward with a release. Tests are failing because I didn't change the files, I might run it later on merge if you think the changes are OK.

matiasandina commented 1 year ago

The dplyr branch is currently missing xend = last(x) + sampling_interval to account for the period between the last observation and the first observation of the next behavior, when computing ethograms.

 data <- data %>% dplyr::mutate(x = seq_along(group)) %>% 
      dplyr::mutate(run_id = vctrs::vec_identify_runs(behaviour)) %>% 
      dplyr::group_by(group, run_id) %>% dplyr::summarise(behaviour = unique(behaviour), 
      xend = last(x), x = first(x), y = unique(y), yend = unique(yend), 
      PANEL = unique(PANEL), colour = unique(colour), 
      .groups = "keep")

It produces "cute" behavior by leaving black the space between the last observation of one behavior and the next one

But I think we should be accurate and add the sampling interval to the last

matiasandina commented 1 year ago

Given #16, I think this is behaving now and we should move forward :smile:

matiasandina / ggethos

`NA`s and `rle()` #11