markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Column types must be consistent for each group when NA #609

Closed jfdesomzee closed 2 years ago

jfdesomzee commented 2 years ago

Hello,

When summarising rows with condition, I have sometimes error because of NA condition

iris %>% 
  tidytable::as_tidytable() %>% 
  tidytable::mutate(SL=ifelse(Species=="setosa",NA,Sepal.Length)) %>% 
  tidytable::summarise(SP_Large=sum(base::ifelse(default(SL)>5,Sepal.Length,0)),
                       .by=Species)

This is returning an error while dplyr equivalent is not. image

iris %>% 
    dplyr::as_tibble() %>% 
    dplyr::mutate(SL=ifelse(Species=="setosa",NA,Sepal.Length)) %>% 
    group_by(Species) %>% 
    dplyr::summarise(SP_Large=sum(base::ifelse(default(SL)>5,Sepal.Length,0)),
                         .by=Species)

image

It's easy to adapt the code to avoid the error but I think dplyr behavior makes more sense.

markfairbanks commented 2 years ago

This is actually a quirk of data.table, and not something I can fix unfortunately. I would just wrap your output in as.double() to avoid this.

Here's a stack overflow example.

However it looks like this might be fixed in the development version of data.table as this no longer errors on my machine. I couldn't find a NEWS bullet on their github repo, but it might be worth downloading that and see if it works for you.

library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported. You can now use `arrange()`/`mutate()`/etc. directly.

iris %>% 
  as_tidytable() %>% 
  mutate(SL=ifelse(Species=="setosa",NA,Sepal.Length)) %>% 
  summarise(SP_Large=sum(base::ifelse(Sepal.Length > 5, Sepal.Length, 0L)),
            .by=Species)
#> # A tidytable: 3 × 2
#>   Species    SP_Large
#>   <fct>         <dbl>
#> 1 setosa         117.
#> 2 versicolor     282.
#> 3 virginica      324.
jfdesomzee commented 2 years ago

Well this would actually be worst for me as the value of SP_Large is not NA for setosa. In case the condition of the ifelse is NA it should not return the TRUE expression.

markfairbanks commented 2 years ago

The example I gave is different than the one you gave because I don't have your default() function.

The values should return correctly when running your example.

jfdesomzee commented 2 years ago

Oh sorry my mistake. The default function is actually the fix and should not have been in the example. With the default function tidytable and dplyr return both the same. The issue is without the default function; In that case dplyr returns NA while tidytable returns the error.

markfairbanks commented 2 years ago

Ah gotcha.

So I can reproduce your error now:

library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported. You can now use `arrange()`/`mutate()`/etc. directly.

iris %>% 
  as_tidytable() %>% 
  mutate(SL=ifelse(Species=="setosa",NA,Sepal.Length)) %>% 
  summarise(SP_Large=sum(base::ifelse(SL>5,Sepal.Length,0)),
            .by=Species)
#> Error in `[.data.table`(~.df, , .(SP_Large = sum(base::ifelse(SL > 5, : Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.

and this would be the recommended fix:

library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported. You can now use `arrange()`/`mutate()`/etc. directly.

iris %>% 
  as_tidytable() %>% 
  mutate(SL=ifelse(Species=="setosa",NA,Sepal.Length)) %>% 
  summarise(SP_Large=as.double(sum(base::ifelse(SL>5,Sepal.Length,0))),
            .by=Species)
#> # A tidytable: 3 × 2
#>   Species    SP_Large
#>   <fct>         <dbl>
#> 1 setosa          NA 
#> 2 versicolor     282.
#> 3 virginica      324.

The error is caused in the execution engine of data.table, and unfortunately there's no way for me to build in a workaround.

jfdesomzee commented 2 years ago

Thanks. This is indeed another work arround. I hope the condition NA won't happens to often in my code and I'll adpat using default or as.double depending on the cases.

markfairbanks commented 2 years ago

Also I would highly recommend using the reprex package to help with reproducible examples.

https://reprex.tidyverse.org

Basic steps: 1) Type out your example code (including any calls to library()) 2) Copy it to your clipboard (like you're going to paste it somewhere) 3) Run reprex::reprex() 4) Paste into the github issue (it will be automatically formatted to markdown with code output)

It also makes sure your examples will be reproducible on my end. For example - it would have caught that the default() function wasn't there. It runs everything in a "clean" environment.

If you run into any issues let me know.

markfairbanks commented 2 years ago

Feel free to give reprex a test run in this issue if you want.