tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

Correctly handle `NA` column names from `pivot_wider()` #395

Closed markfairbanks closed 2 years ago

markfairbanks commented 2 years ago

Closes #394

Pretty sure this is failing checks due to things updated in #393. So we can wait to merge until that is ready to go.

markfairbanks commented 2 years ago

FYI checks are passing now that #393 is merged (so we're definitely good for review).

eutwt commented 2 years ago

I don't follow the numeric substitution logic. I don't think I'd expect step_setnames(lazdt, c(4, NA), c('s', 'j')) to rename the second column toj. Is it related to the same issue that we'd want that behavior?

Do we even need to change step_setnames? It seems that pivot_wider() already attempts to replicate the data.table varnames created by dcast by creating new_vars, it's just wrong in this edge case. If we add

new_vars <- vctrs::vec_assign(new_vars, is.na(new_vars), "NA") 

right after this:

https://github.com/tidyverse/dtplyr/blob/55e929881ceef77f1a6682a064a0ee21d6f068e8/R/step-call-pivot_wider.R#L69-L75

Does that solve the issue?

markfairbanks commented 2 years ago

I don't follow the numeric substitution logic. I don't think I'd expect step_setnames(lazdt, c(4, NA), c('s', 'j')) to rename the second column toj. Is it related to the same issue that we'd want that behavior?

Hmm now that you mention it it is probably better to just ignore the numeric case in replace_setnames_na(). Not sure why I thought it was necessary to build in 😅

markfairbanks commented 2 years ago

@eutwt ready for review again.

Do we even need to change step_setnames?

I think it's a good idea to fix this in step_setnames() so that it is fixed in all cases in the future. If you still think the fix should be in pivot_wider() let me know, I'm open to discuss more.

eutwt commented 2 years ago

In my mind it would be cleaner if the step_setnames function took valid old/new names and just applied them. It feels a little unintuitive that it would be changing your old/new names for you, possibly increasing confusion if we somehow have unintentionally missing values in input vectors, see examples below.

But, this is an extremely minor point, up to opinion, and the function is internal anyway. I think this is fine to merge as-is

Behavior comparison Main: ``` r new_name <- names(ToothGrowth)[6] # missing by mistaken index step_setnames( lazy_dt(ToothGrowth), "dose", new_name, in_place = FALSE ) #> Error: NA in 'new' at positions [1] ``` ``` r old_name <- names(ToothGrowth)[6] # missing by mistaken index step_setnames( lazy_dt(ToothGrowth %>% mutate("NA" = 3)), old_name, "new", in_place = FALSE ) #> Error: NA in 'new' at positions [1] ``` This PR: ``` r new_name <- names(ToothGrowth)[6] # missing by mistaken index step_setnames( lazy_dt(ToothGrowth), "dose", new_name, in_place = FALSE ) #> Source: local data table [60 x 3] #> Call: setnames(`_DT1`, "dose", "NA") #> #> len supp `NA` #> #> 1 4.2 VC 0.5 #> 2 11.5 VC 0.5 #> 3 7.3 VC 0.5 #> 4 5.8 VC 0.5 #> 5 6.4 VC 0.5 #> 6 10 VC 0.5 #> # … with 54 more rows #> #> # Use as.data.table()/as.data.frame()/as_tibble() to access results ``` ``` r old_name <- names(ToothGrowth)[6] # missing by mistaken index step_setnames( lazy_dt(ToothGrowth %>% mutate("NA" = 3)), old_name, "new", in_place = FALSE ) #> Source: local data table [60 x 4] #> Call: setnames(`_DT1`, "NA", "new") #> #> len supp dose new #> #> 1 4.2 VC 0.5 3 #> 2 11.5 VC 0.5 3 #> 3 7.3 VC 0.5 3 #> 4 5.8 VC 0.5 3 #> 5 6.4 VC 0.5 3 #> 6 10 VC 0.5 3 #> # … with 54 more rows #> #> # Use as.data.table()/as.data.frame()/as_tibble() to access results ```
markfairbanks commented 2 years ago

It feels a little unintuitive that it would be changing your old/new names for you, possibly increasing confusion if we somehow have unintentionally missing values in input vectors

Wouldn't any workaround we build run into this issue? (like the one mentioned in here)

eutwt commented 2 years ago

Basically I see this as being due to the new_vars vector being mis-constructed, not due to some problem in how the new names are applied. Concretely, the example below still errors in this PR, because the NA->"NA" is done in step_setnames() which isn't used.

lazy_dt(tibble(x = c('a', NA), y = 2)) %>% 
  pivot_wider(names_from = 'x', values_from = 'y')
markfairbanks commented 2 years ago

Ah yep, that makes sense.