markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Deprecate `verb.()` syntax? #617

Closed markfairbanks closed 1 year ago

markfairbanks commented 2 years ago

Now that verb() syntax is exported, should verb.() syntax be deprecated or kept in the package?

Naturally it would be a soft deprecation with a long deprecation lifecycle so it won't break old scripts. It would also be pretty easy to update old scripts with a simple find and replace of .( with (

One noteworthy negative of keeping both - R CMD check is not a fan of both syntaxes having S3 methods exported as it confuses verb.() syntax as an "empty class" method (#612). Currently verb() functions cannot be exported as S3 generics, limiting other packages extending tidytable.

Edit: Probably the easiest thing to do would be to direct people to using box if they need to mix tidytable and dplyr syntax.

mattsams89 commented 2 years ago

I'd definitely err on the side of very slow deprecation with large warnings. Haha. One of our analysts upstairs texted me after the 0.9 release...he definitely didn't read the news file and had all of his scripts break since he p_load()s tidytable and tidyverse (not sure on the order). I think he's similar to me in that he goes back and forth on unnest_ variants, etc. based on how well data.table deals with his underlying data structure (providers love lists of lists of lists of...). I don't think he namespaces any of his stuff, which definitely compounded the issue.

I took a look at box and mentioned it to him as well. That seems like a potential solution, but going back through our code bases and checking what needs to be called and where is definitely going to be a time sink. Obviously only two people in the sea of package users, but figured I'd share our experiences regardless.

markfairbanks commented 2 years ago

@mattsams89 Yep this is all definitely worth considering.

he definitely didn't read the news file

To be honest I doubt the average R user ever reads a news file (or even knows they exist). That was one of the reasons for the new startup message.

library(tidytable)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.

had all of his scripts break since he p_load()s tidytable and tidyverse

This is unfortunately a failure point of the update. If you use library() there is a warning message if you load tidytable after dplyr. This gets silenced if you use pacman::p_load() (which I use as well but I can't work around).

library(dplyr, warn.conflicts = FALSE)
library(tidytable, warn.conflicts = FALSE)
#> Warning: tidytable was loaded after dplyr.
#> This can lead to most dplyr functions being overwritten by tidytable functions.

I think he's similar to me in that he goes back and forth on unnest_ variants

This is also a goal of mine to update (see #631). Hopefully that eases the transition a bit.

Obviously only two people in the sea of package users, but figured I'd share our experiences regardless.

Thanks for the input as always 😄

psimm commented 2 years ago

Deprecating the verb.() syntax would wreak havoc on dozens of scripts at my organization. They use a mix of dplyr for fetching data from SQL databases via dbplyr and tidytable for manipulating data in memory.

The addition of the dotless functions already creates some headache that we plan to resolve by using conflicted::conflict_prefer("filter", "dplyr") and similar from the conflicted package.

Without the verb.() functions we'd have to always write out the package name, i.e. tidytable::filter() %>% tidytable::arrange(), which is cumbersome.

markfairbanks commented 2 years ago

@psimm thanks for the input.

I think dbplyr is one of the biggest reasons I'm leaning towards keeping the verb.() syntax.

One thing I've considered - if I do deprecate verb.(), what if I exported a function that would allow you to easily alias packages? Something like import_as()?

Edit: More robust import_as()

pacman::p_load(rlang, tidytable)

import_as <- function(package, alias) {
  package <- as_name(enexpr(package))

  funs <- set_names(getNamespaceExports(package))
  funs <- map(funs, ~ eval(call2("::", sym(package), sym(.x))))
  funs <- as_environment(funs)

  env_bind(caller_env(), {{ alias }} := funs)
}

library(dbplyr)
import_as(dplyr, dp)

df <- memdb_frame(x = 1:3, y = 1:3, z = c("a", "a", "b"))

df %>%
  dp$select(x, y, z) %>%
  dp$mutate(double_x = x * 2) %>%
  dp$collect() %>%
  summarize(mean_x = mean(x), .by = z)
#> # A tidytable: 2 × 2
#>   z     mean_x
#>   <chr>  <dbl>
#> 1 a        1.5
#> 2 b        3
psimm commented 2 years ago

One thing I've considered - if I do deprecate verb.(), what if I exported a function that would allow you to easily alias packages? Something like import_as()?

I like your import_as() because I prefer Python's import system over R's in general. It seems a little odd to add it to tidytable though, because it's a general package management function rather than a data manipulation function.

Have you seen the package import (https://import.rticulate.org)? It gives fine-grained control over which functions to import from where and makes it easy to rename functions. Perhaps it could be recommended to tidytable users.

markfairbanks commented 2 years ago

Hmm yep I agree, it would be a little bit of an odd fit for tidytable. But it also seems like a good idea to have a solution within tidytable that people can use. Not sure what the best option is.

Maybe I could export import_as() and recommend people to use import or box for more complicated use cases.

statzhero commented 2 years ago

I just saw the changelog 0.9.0. Seems major news to me. I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall. This reminds me: what are your thoughts on tidytable versus dtplyr?

So far I've been mixing dplyr and the verb.() syntax but mostly because I was using it interactively. Second, apart from some missing functions, the main difference I found in practice is to use groups per function e.g. dplyr::mutate vs dtplyr::mutate vs tidytable::mutate(.by = var). The latter is sometimes quite handy, I even want to say some development version of dplyr::mutate can do it as well.

Another important reason I often stick with dplyr is the tidylog package. I don't know if there's anything tidytable can do to encourage similar feedback, especially for joins.

markfairbanks commented 2 years ago

I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall.

There were a few parts to the decision: 1) tidytable now covers the vast majority of functions and function arguments that are in dplyr/tidyr. The original point of the dot was so you could mix dplyr and tidytable syntax - but there is much less of a need for that now. At this point the biggest need for dplyr syntax would be in using dbplyr. 2) tidytable now has a simple explanation - "load tidytable and your existing dplyr/tidyr code will run faster" 3) I had many people reach out saying the verb.() could get confusing because they weren't always positive what functions were base R, tidyselect, tidytable, etc. So users were trying to use things like mean.() or starts_with.() and constantly running into errors. Now users don't have to worry about that.

the main difference I found in practice is to use groups per function e.g. dplyr::mutate vs dtplyr::mutate vs tidytable::mutate(.by = var)

FYI tidytable now offers both of these options for grouping:

library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.

df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3)

df %>%
  summarize(avg_z = mean(z), .by = c(x, y))
#> # A tidytable: 2 × 3
#>   x     y     avg_z
#>   <chr> <chr> <dbl>
#> 1 a     a       1.5
#> 2 b     b       3

df %>%
  group_by(x, y) %>%
  summarize(avg_z = mean(z)) %>%
  ungroup()
#> # A tidytable: 2 × 3
#>   x     y     avg_z
#>   <chr> <chr> <dbl>
#> 1 a     a       1.5
#> 2 b     b       3

This reminds me: what are your thoughts on tidytable versus dtplyr?

I'm actually a co-author of dtplyr now and have done a good amount of the updates over the last year or so.

I personally use tidytable because it has more functions/more function arguments. You also don't need to worry about using lazy_dt()/collect().

But dtplyr has some advantages too - it can show you translations and integrates much better with the broader tidyverse (e.g. dbplyr).

From there it's sort of up to you which one to choose.

Another important reason I often stick with dplyr is the tidylog package.

Extension packages will always be an advantage of the tidyverse. There is a much broader user base so more people will make packages that utilize dplyr/tidyr 🤷‍♂️

Hope this helps! If you have any other questions let me know 😄

statzhero commented 2 years ago

That’s great, thank you. So under the hood dtplyr and tidytable are the same, in general terms?

I suppose another solution is to use say ‘tidytable::complete()’ explicitly if there are only a few steps that need speeding up and where ‘tidyverse’ is particularly slow. This also happened to me.

On Oct 18, 2022, at 23:31, Mark Fairbanks @.***> wrote:

 I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall.

There were a couple parts to the decision:

tidytable now covers the vast majority of functions and function arguments that are in dplyr/tidyr. The original point of the dot was so you could mix dplyr and tidytable syntax - but there is much less of a need for that now. At this point the biggest need for dplyr syntax would be in using dbplyr. tidytable now has a simple "load tidytable and your existing dplyr/tidyr code will run faster". I had many people reach out saying the verb.() could get confusing because they weren't always positive what functions were base R, tidyselect, tidytable, etc. So users were trying to use things like mean.() or starts_with.() and constantly running into errors. This greatly reduces that issue. the main difference I found in practice is to use groups per function e.g. dplyr::mutate vs dtplyr::mutate vs tidytable::mutate(.by = var)

FYI tidytable now offers both of these options for grouping:

library(tidytable, warn.conflicts = FALSE)

> As of tidytable v0.9.0 dotless versions of functions are exported.

> You can now use arrange()/mutate()/etc. directly.

df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3)

df %>% summarize(avg_z = mean(z), .by = c(x, y))

> # A tidytable: 2 × 3

> x y avg_z

>

> 1 a a 1.5

> 2 b b 3

df %>% group_by(x, y) %>% summarize(avg_z = mean(z)) %>% ungroup()

> # A tidytable: 2 × 3

> x y avg_z

>

> 1 a a 1.5

> 2 b b 3

This reminds me: what are your thoughts on tidytable versus dtplyr?

I'm actually a co-author of dtplyr now and have done a good amount of the updates over the last year or so.

I personally use tidytable because it has more functions/more function arguments. You also don't need to worry about using lazy_dt()/collect().

But dtplyr has some advantages too - it can show you translations and integrates much better with the broader tidyverse (e.g. dbplyr).

From there it's sort of up to you which one to choose.

Another important reason I often stick with dplyr is the tidylog package.

Extension packages will always be an advantage of the tidyverse. There is a much broader user base so more people will make packages that utilize dplyr/tidyr 🤷‍♂️

Hope this helps! If you have any other questions let me know 😄

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

markfairbanks commented 2 years ago

So under the hood dtplyr and tidytable are the same, in general terms?

Yes, for the most part. There are some differences between translations because dtplyr uses lazy evaluation and tidytable uses eager evaluation, and some places where dtplyr can't make a translation because of how lazy evaluation works.

tidytable also utilizes vctrs (a tidyverse backend package) when it's faster than data.table. dtplyr tries to stick to data.table translations only.

Here are a few examples of things that are limited in dtplyr that are already in tidytable.

statzhero commented 1 year ago

For completeness: the new ".by" dplyr feature has been known for a while but it is now available in the development version.

https://github.com/tidyverse/dplyr/blob/0a55cf597ed4e70183ea3fe47a67b11dbb914d0c/NEWS.md

markfairbanks commented 1 year ago

I'm just glad they named it .by so I don't need to recode anything in tidytable 😂