tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

By early grouping fix #443

Closed eutwt closed 1 year ago

eutwt commented 1 year ago

Closes #439

Leaving in draft status for now because I don't understand why this fixes the issue

markfairbanks commented 1 year ago

So I have no idea why this is occurring (or why step_group() fixes it), but FYI this also happens with other functions.

pacman::p_load(dtplyr, dplyr)

iris %>%
  lazy_dt() %>%
  select(Species, Sepal.Length) %>%
  filter(Sepal.Length <= mean(Sepal.Length),
         .by = Species)
#> Source: local data table [82 x 3]
#> Call:
#>   _DT2 <- `_DT1`[, .(Species, Sepal.Length), keyby = .(Species)]
#>   `_DT2`[`_DT2`[, .I[Sepal.Length <= mean(Sepal.Length)], by = .(Species)]$V1]
#> 
#>   Species Species Sepal.Length
#>   <fct>   <fct>          <dbl>
#> 1 setosa  setosa           4.9
#> 2 setosa  setosa           4.7
#> 3 setosa  setosa           4.6
#> 4 setosa  setosa           5  
#> 5 setosa  setosa           4.6
#> 6 setosa  setosa           5  
#> # ℹ 76 more rows
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
markfairbanks commented 1 year ago

This has to be something weird about R's copy-on-modify semantics, but it's way over my head what could be happening. It makes no sense to me that it would alter the object for the prior step.

markfairbanks commented 1 year ago

So basically the error is caused by something weird in R's copy-on-modify semantics. As far as I know, it shouldn't be altering the object for the prior step when using parent$group <-, but it does. No idea why.

step_group() fixes it because it uses new_step(), which is creating a "new" dtplyr_step object from scratch. So the weird copy-on-modify behavior is skipped.

That being sad - I think we just need to switch out our uses of object$group throughout the code base to ensure we're not messing this up in other places.

eutwt commented 1 year ago

That makes sense. Closing this PR since there's a larger issue

eutwt commented 1 year ago

@markfairbanks I don't think it is actually to do with copy-on-modify, the object is only modified within the scope of the function. But, within the scope of the function is all that matters, because that's what's passed to new_step, whose output determines the data.table expression. So when dt_call is recursing down the steps to create the expression, the parent which has been modified gets evaluated by dt_call.dtplyr_step_subset (in this case, since it's a select step) as a step with groups, because they were added. (But, the output of step_group is a different class and never evaluated by dt_call.dtplyr_step_subset)

markfairbanks commented 1 year ago

Ah yep I see what you mean. The issue is still a bit over my head though 😅

I might have some time to revisit in the future, but hopefully #445 covers what we need for now.

eutwt commented 1 year ago

Yeah, the fix is the same regardless

markfairbanks commented 1 year ago

Ah great, good to know.