Closed markfairbanks closed 1 year ago
I'd definitely err on the side of very slow deprecation with large warnings. Haha. One of our analysts upstairs texted me after the 0.9 release...he definitely didn't read the news file and had all of his scripts break since he p_load()
s tidytable
and tidyverse
(not sure on the order). I think he's similar to me in that he goes back and forth on unnest_
variants, etc. based on how well data.table
deals with his underlying data structure (providers love lists of lists of lists of...). I don't think he namespaces any of his stuff, which definitely compounded the issue.
I took a look at box
and mentioned it to him as well. That seems like a potential solution, but going back through our code bases and checking what needs to be called and where is definitely going to be a time sink. Obviously only two people in the sea of package users, but figured I'd share our experiences regardless.
@mattsams89 Yep this is all definitely worth considering.
he definitely didn't read the news file
To be honest I doubt the average R user ever reads a news file (or even knows they exist). That was one of the reasons for the new startup message.
library(tidytable)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.
had all of his scripts break since he
p_load()
stidytable
andtidyverse
This is unfortunately a failure point of the update. If you use library()
there is a warning message if you load tidytable
after dplyr
. This gets silenced if you use pacman::p_load()
(which I use as well but I can't work around).
library(dplyr, warn.conflicts = FALSE)
library(tidytable, warn.conflicts = FALSE)
#> Warning: tidytable was loaded after dplyr.
#> This can lead to most dplyr functions being overwritten by tidytable functions.
I think he's similar to me in that he goes back and forth on unnest_ variants
This is also a goal of mine to update (see #631). Hopefully that eases the transition a bit.
Obviously only two people in the sea of package users, but figured I'd share our experiences regardless.
Thanks for the input as always 😄
Deprecating the verb.()
syntax would wreak havoc on dozens of scripts at my organization. They use a mix of dplyr
for fetching data from SQL databases via dbplyr
and tidytable
for manipulating data in memory.
The addition of the dotless functions already creates some headache that we plan to resolve by using conflicted::conflict_prefer("filter", "dplyr")
and similar from the conflicted
package.
Without the verb.()
functions we'd have to always write out the package name, i.e. tidytable::filter() %>% tidytable::arrange()
, which is cumbersome.
@psimm thanks for the input.
I think dbplyr
is one of the biggest reasons I'm leaning towards keeping the verb.()
syntax.
One thing I've considered - if I do deprecate verb.()
, what if I exported a function that would allow you to easily alias packages? Something like import_as()
?
Edit: More robust import_as()
pacman::p_load(rlang, tidytable)
import_as <- function(package, alias) {
package <- as_name(enexpr(package))
funs <- set_names(getNamespaceExports(package))
funs <- map(funs, ~ eval(call2("::", sym(package), sym(.x))))
funs <- as_environment(funs)
env_bind(caller_env(), {{ alias }} := funs)
}
library(dbplyr)
import_as(dplyr, dp)
df <- memdb_frame(x = 1:3, y = 1:3, z = c("a", "a", "b"))
df %>%
dp$select(x, y, z) %>%
dp$mutate(double_x = x * 2) %>%
dp$collect() %>%
summarize(mean_x = mean(x), .by = z)
#> # A tidytable: 2 × 2
#> z mean_x
#> <chr> <dbl>
#> 1 a 1.5
#> 2 b 3
One thing I've considered - if I do deprecate
verb.()
, what if I exported a function that would allow you to easily alias packages? Something likeimport_as()
?
I like your import_as()
because I prefer Python's import system over R's in general. It seems a little odd to add it to tidytable
though, because it's a general package management function rather than a data manipulation function.
Have you seen the package import
(https://import.rticulate.org)? It gives fine-grained control over which functions to import from where and makes it easy to rename functions. Perhaps it could be recommended to tidytable
users.
Hmm yep I agree, it would be a little bit of an odd fit for tidytable
. But it also seems like a good idea to have a solution within tidytable
that people can use. Not sure what the best option is.
Maybe I could export import_as()
and recommend people to use import
or box
for more complicated use cases.
I just saw the changelog 0.9.0. Seems major news to me. I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall. This reminds me: what are your thoughts on tidytable
versus dtplyr
?
So far I've been mixing dplyr
and the verb.()
syntax but mostly because I was using it interactively. Second, apart from some missing functions, the main difference I found in practice is to use groups per function e.g. dplyr::mutate
vs dtplyr::mutate
vs tidytable::mutate(.by = var)
. The latter is sometimes quite handy, I even want to say some development version of dplyr::mutate
can do it as well.
Another important reason I often stick with dplyr
is the tidylog package. I don't know if there's anything tidytable
can do to encourage similar feedback, especially for joins.
I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall.
There were a few parts to the decision:
1) tidytable now covers the vast majority of functions and function arguments that are in dplyr/tidyr. The original point of the dot was so you could mix dplyr and tidytable syntax - but there is much less of a need for that now. At this point the biggest need for dplyr syntax would be in using dbplyr.
2) tidytable now has a simple explanation - "load tidytable and your existing dplyr/tidyr code will run faster"
3) I had many people reach out saying the verb.()
could get confusing because they weren't always positive what functions were base R, tidyselect, tidytable, etc. So users were trying to use things like mean.()
or starts_with.()
and constantly running into errors. Now users don't have to worry about that.
the main difference I found in practice is to use groups per function e.g. dplyr::mutate vs dtplyr::mutate vs tidytable::mutate(.by = var)
FYI tidytable now offers both of these options for grouping:
library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.
df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3)
df %>%
summarize(avg_z = mean(z), .by = c(x, y))
#> # A tidytable: 2 × 3
#> x y avg_z
#> <chr> <chr> <dbl>
#> 1 a a 1.5
#> 2 b b 3
df %>%
group_by(x, y) %>%
summarize(avg_z = mean(z)) %>%
ungroup()
#> # A tidytable: 2 × 3
#> x y avg_z
#> <chr> <chr> <dbl>
#> 1 a a 1.5
#> 2 b b 3
This reminds me: what are your thoughts on tidytable versus dtplyr?
I'm actually a co-author of dtplyr now and have done a good amount of the updates over the last year or so.
I personally use tidytable because it has more functions/more function arguments. You also don't need to worry about using lazy_dt()
/collect()
.
But dtplyr has some advantages too - it can show you translations and integrates much better with the broader tidyverse (e.g. dbplyr).
From there it's sort of up to you which one to choose.
Another important reason I often stick with dplyr is the tidylog package.
Extension packages will always be an advantage of the tidyverse. There is a much broader user base so more people will make packages that utilize dplyr/tidyr 🤷♂️
Hope this helps! If you have any other questions let me know 😄
That’s great, thank you. So under the hood dtplyr and tidytable are the same, in general terms?
I suppose another solution is to use say ‘tidytable::complete()’ explicitly if there are only a few steps that need speeding up and where ‘tidyverse’ is particularly slow. This also happened to me.
On Oct 18, 2022, at 23:31, Mark Fairbanks @.***> wrote:
I'm slightly worried that no dot verbs will get confusing with the multiple packages using the same verb but I understand it may make more sense overall.
There were a couple parts to the decision:
tidytable now covers the vast majority of functions and function arguments that are in dplyr/tidyr. The original point of the dot was so you could mix dplyr and tidytable syntax - but there is much less of a need for that now. At this point the biggest need for dplyr syntax would be in using dbplyr. tidytable now has a simple "load tidytable and your existing dplyr/tidyr code will run faster". I had many people reach out saying the verb.() could get confusing because they weren't always positive what functions were base R, tidyselect, tidytable, etc. So users were trying to use things like mean.() or starts_with.() and constantly running into errors. This greatly reduces that issue. the main difference I found in practice is to use groups per function e.g. dplyr::mutate vs dtplyr::mutate vs tidytable::mutate(.by = var)
FYI tidytable now offers both of these options for grouping:
library(tidytable, warn.conflicts = FALSE)
> As of tidytable v0.9.0 dotless versions of functions are exported.
> You can now use
arrange()
/mutate()
/etc. directly.df <- tidytable(x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3)
df %>% summarize(avg_z = mean(z), .by = c(x, y))
> # A tidytable: 2 × 3
> x y avg_z
>
> 1 a a 1.5
> 2 b b 3
df %>% group_by(x, y) %>% summarize(avg_z = mean(z)) %>% ungroup()
> # A tidytable: 2 × 3
> x y avg_z
>
> 1 a a 1.5
> 2 b b 3
This reminds me: what are your thoughts on tidytable versus dtplyr?
I'm actually a co-author of dtplyr now and have done a good amount of the updates over the last year or so.
I personally use tidytable because it has more functions/more function arguments. You also don't need to worry about using lazy_dt()/collect().
But dtplyr has some advantages too - it can show you translations and integrates much better with the broader tidyverse (e.g. dbplyr).
From there it's sort of up to you which one to choose.
Another important reason I often stick with dplyr is the tidylog package.
Extension packages will always be an advantage of the tidyverse. There is a much broader user base so more people will make packages that utilize dplyr/tidyr 🤷♂️
Hope this helps! If you have any other questions let me know 😄
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.
So under the hood dtplyr and tidytable are the same, in general terms?
Yes, for the most part. There are some differences between translations because dtplyr uses lazy evaluation and tidytable uses eager evaluation, and some places where dtplyr can't make a translation because of how lazy evaluation works.
tidytable also utilizes vctrs (a tidyverse backend package) when it's faster than data.table. dtplyr tries to stick to data.table translations only.
Here are a few examples of things that are limited in dtplyr
that are already in tidytable
.
Things that can't be implemented in dtplyr
where()
- you can't know the types of the columns in lazy evaluation as the columns don't actually existbind_rows()
/_cols()
- we can't build an S3 method since that's not how they work in dplyr
crossing()
/expand_grid()
/etc. - things made to work outside of a data frame contextThings that can be implemented eventually in dtplyr
once data.table
translations become available (or better).
fill()
on character/factor/logical columns. In tidytable
I use vctrs::vec_fill_missing()
in the background for these cases. data.table::nafill()
doesn't support these types.unnest()
: data.table
support for unnesting is a bit limited at the moment. In dtplyr
's lazy workflow we can't identify if a list column contains data frames or vectors. tidytable
doesn't have to worry about this since it uses eager evaluation.For completeness: the new ".by" dplyr feature has been known for a while but it is now available in the development version.
https://github.com/tidyverse/dplyr/blob/0a55cf597ed4e70183ea3fe47a67b11dbb914d0c/NEWS.md
I'm just glad they named it .by
so I don't need to recode anything in tidytable
😂
Now that
verb()
syntax is exported, shouldverb.()
syntax be deprecated or kept in the package?Naturally it would be a soft deprecation with a long deprecation lifecycle so it won't break old scripts. It would also be pretty easy to update old scripts with a simple find and replace of
.(
with(
One noteworthy negative of keeping both -
R CMD check
is not a fan of both syntaxes having S3 methods exported as it confusesverb.()
syntax as an "empty class" method (#612). Currentlyverb()
functions cannot be exported as S3 generics, limiting other packages extendingtidytable
.Edit: Probably the easiest thing to do would be to direct people to using box if they need to mix
tidytable
anddplyr
syntax.