markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
449 stars 33 forks source link

`group_by` + equality `filter` on column with `NA` values introduces `NA` rows #812

Closed geoffreyyip closed 2 months ago

geoffreyyip commented 2 months ago

When equality filtering on a column with NA values, doing a group_by operation prior to filtering can introduce rows with NA values. This is unexpected since tidyverse does not introduce NA values

Note: I use str() to print the output, since tidytable currently doesn't print in reprex() See Issue #810

library(tidytable, tidyverse)
#> 
#> Attaching package: 'tidytable'
#> The following objects are masked from 'package:stats':
#> 
#>     dt, filter, lag
#> The following object is masked from 'package:base':
#> 
#>     %in%

#------------------------------------------------------------------------------
# The tidyverse version GOOD:  No NAs introduced
#------------------------------------------------------------------------------
t1 <- dplyr::tribble(
  ~id, ~status,
  1, 1,
  4, 0,
  7, NA,
)

output1 <- t1 |>
  dplyr::group_by(id) |>
  dplyr::filter(status == 1)

str(output1)
#> gropd_df [1 × 2] (S3: grouped_df/tbl_df/tbl/data.frame)
#>  $ id    : num 1
#>  $ status: num 1
#>  - attr(*, "groups")= tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ id   : num 1
#>   ..$ .rows: list<int> [1:1] 
#>   .. ..$ : int 1
#>   .. ..@ ptype: int(0) 
#>   ..- attr(*, ".drop")= logi TRUE

#------------------------------------------------------------------------------
# The tidytable version BAD: An extra NA row is introduced with grouped + NA values
#------------------------------------------------------------------------------

t2 <- tidytable::tribble(
  ~id, ~status,
  1, 1,
  4, 0,
  7, NA,
)

output2 <- t2 |>
  tidytable::group_by(id) |>
  tidytable::filter(status == 1)

str(output2)
#> Classes 'grouped_tt', 'tidytable', 'data.table' and 'data.frame':    2 obs. of  2 variables:
#>  $ id    : num  1 NA
#>  $ status: num  1 NA
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "groups")= chr "id"

#------------------------------------------------------------------------------
# The tidytable version GOOD: Without the group_by, it works fine
#------------------------------------------------------------------------------

t3 <- tidytable::tribble(
  ~id, ~status,
  1, 1,
  4, 0,
  7, NA,
)

output3 <- t3 |>
  # tidytable::group_by(id) |> 
  tidytable::filter(status == 1)

str(output3)
#> Classes 'tidytable', 'data.table' and 'data.frame':  1 obs. of  2 variables:
#>  $ id    : num 1
#>  $ status: num 1
#>  - attr(*, ".internal.selfref")=<externalptr>

#------------------------------------------------------------------------------
# The tidytable version GOOD: Without the NA in the table, it also works fine
#------------------------------------------------------------------------------

t4 <- tidytable::tribble(
  ~id, ~status,
  1, 1,
  4, 0,
  # 7, NA,
)

output4 <- t4 |>
  tidytable::group_by(id) |>
  tidytable::filter(status == 1)

str(output4)
#> Classes 'grouped_tt', 'tidytable', 'data.table' and 'data.frame':    1 obs. of  2 variables:
#>  $ id    : num 1
#>  $ status: num 1
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "groups")= chr "id"

Created on 2024-07-01 with reprex v2.1.0

markfairbanks commented 2 months ago

Hmm that's odd I'll take a look

markfairbanks commented 2 months ago

All fixed - thanks for catching this