tb vs nakedpipe - Githubissues

moodymudskipper commented 4 years ago

I'm now realising that nakedpipe and tb share a significant overlap.

https://github.com/moodymudskipper/tb

The shortcuts to mutate and filter are the clearer examples :

https://github.com/moodymudskipper/nakedpipe/issues/18

Things that would not be straightforward to convert :

self referencing in mutating/summarizing calls couldn't use ., we could use ..
anything using the by argument (to, summarize more than once, spread etc)
mutate by ("along" in tb-speak)
transmute, select
slice
the functional feature in tb uses a ~~ syntax, we can't use if for different things, we could use +~

Some of those things can be worked around BUT it might be easier to be able to support tb easily :

iris %.% {
  ...
  .tb[...]
  ...

}

it would be unambiguous, and analogous to .[]

We could support data.table too :

iris %.% {
  ...
  .dt[...]
  ...

}

both .tb[...] and .dt[...] will return an object of the original class, but they'll convert to tb or data.table for the time of the operation.

Right now I think .tb[...] and .dt[...] are good ideas but .tb[] shouldn't be advertised/documented until tb and naked pipe behaves in more similar ways, with syntax that doesn't clash.

We can already implement though, it won't break anything.

The expansion of the features of nakedpipe, inspired from tb, like renaming, splicing with unary + or vectorizing with +~ are a separate thing, more complicated and more confusing, and if we just need .tb[...] to get those anyway the value of the shortcut is limited.

moodymudskipper commented 4 years ago

.tb[...] and .dt[...] are implemented.

I'm not sure about the rest, it could be interesting but it might lead to situations where data manipulations shorthand are hard to distinguish from regular piped function call, because the steps become too sophisticated. Also tb is still a mess and might be forever so we cannot commit to being consistent yet, we're safe if we stay as we are for now.

Some ideas for later :

using "self" in mutate calls by having function on rhs

iris %.% {Species = toupper} # same as iris %.% {Species = toupper(Species)}
iris %.% {Species = ~substr(., 1, 2)} # same as iris %.% {Species = substr(Species, 1 , 2)}

This is unambiguous because data.frame columns cannot be functions nor formulas

`?` expressions to mutate_at / if

iris %.% {?"^P" = sqrt} # apply sqrt on cols starting with "P"
iris %.% {?is.numeric = sqrt} # apply on num cols
iris %.% {?~is.numeric(.) = ~sqrt(.)} # same

I think it's possible only since they fixed the precedence of ? in R 4

`:=` to evaluate lhs and use it as "at"

`iris %.% {c("Sepal.Length", "Sepal.Width") := sqrt}
`iris %.% {1:3 := sqrt} # by numeric index

moodymudskipper commented 4 years ago

About aggregation, this doesn't look too bad :

test <- iris %.% {
  data.frame(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

The idea is that you're stacking the created data frames by Species, but it's long so we can easily confuse it with regular pipe steps, i.e. we want to insert a dot.

This is not bad either :

test <- iris %.% {
  summarize(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

In this case the suffix ~ Species is really just a way to group_by for a single step, but it depends on dplyr while challenging dplyr's idiomatic syntax, so not perfect either.

We could summon ? to the rescue and do :

test <- iris %.% {
  ?data.frame(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

but ? doesn't convey the right intuition, we have also ++ or -- which we can use.

We can have our own verb :

test <- iris %.% {
  agg(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

Or we can use .() to keep using the dot :

test <- iris %.% {
  .(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

This is really compact and nice, and doesn't look like other calls, but between the .() of bquote that we use in tb, and the .() of data.table, we might be confusing too

moodymudskipper commented 4 years ago

Or we keep current behavior but add :

test <- iris %.% {
  {
    mean_sl = mean(Sepal.Length)
    mean_pl = mean(Petal.Length)) 
  } ~ Species
}

This keeps the syntax similar to the transform shorthand, and this would still work as it does now :

test <- iris %.% {
  mean_sl = mean(Sepal.Length) ~ Species
}

The main issue here is that {} is already used to mean "don't insert dots", as in magrittr. It's not ambiguous because here we'd pipe to ~ anyway, but maybe confusing ?

We could prefix it by something, and this something might indicate if we mutate by or summarize by. This parses ok:

test <- iris %.% {
  agg:{
    mean_sl = mean(Sepal.Length)
    mean_pl = mean(Petal.Length)
  } ~ Species
}

test <- iris %.% {
  along:{
    mean_sl = mean(Sepal.Length)
    mean_pl = mean(Petal.Length)
  } ~ Species
}

moodymudskipper commented 4 years ago

might be worth to show alternatives here to see if it's all worth it :

test <- iris %.% {
  group_by(Species)
  summarize(
    mean_sl = mean(Sepal.Length),
    mean_pl = mean(Petal.Length),
  )
}

test <- iris %.% {
  .dt[, .(
    mean_sl = mean(Sepal.Length),
    mean_pl = mean(Petal.Length),
    ), by = Species]
}

test <- iris %.% {
  .tb[
    mean_sl = mean(Sepal.Length),
    mean_pl = mean(Petal.Length),
    .by = Species]
}

I think tb wins and .dt is quite good already, so maybe better wait for tb and not introduce more exotic stuff

moodymudskipper commented 3 years ago

A few months later I like this one :

test <- iris %.% {
  summarize(mean_sl = mean(Sepal.Length), mean_pl = mean(Petal.Length)) ~ Species
}

I say above :

In this case the suffix ~ Species is really just a way to group_by for a single step, but it depends on dplyr while challenging dplyr's idiomatic syntax, so not perfect either.

But it's not true, it doesn't have to depend on dplyr.

Also it doesn't have to be limited to summarize().

The way we can do it is that when piping to ~ we split our lhs (fail if not a data.frame) by the vars we find in the rhs, then apply the lsh, then bind back. so we can also "mutate by" using transform() for instance, and then no dependency.

This case which works now but is undocumented could still work :

iris %.% {sum(Sepal.Width) ~ Species}
#>      Species sum(Sepal.Width)
#> 1     setosa            171.4
#> 2 versicolor            138.5
#> 3  virginica            148.7

And we could have those work too as summarize calls :

iris %.% {ssw = sum(Sepal.Width) ~ Species}
#>      Species              ssw
#> 1     setosa            171.4
#> 2 versicolor            138.5
#> 3  virginica            148.7

iris %.% {
  { 
    ssw = sum(Sepal.Width) 
    sum(Sepal.Length) 
  } ~ Species
}

Though the 2 last ones might make the user think we're doing a "mutate by" so not so sure about those, but "mutate by" is not used as frequently and I think it should be intuitive enough.

moodymudskipper commented 3 years ago

From the SO post comparing data.table and dplyr :

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

Our equivalent :

diamonds %.% {
  cut != "Fair"
  {
    AvgPrice = mean(price)
    MedianPrice = as.numeric(median(price))
    Count = nrow(.)
  } ~ cut
  .[order(-Count),]
}

moodymudskipper commented 3 years ago

let's close this, and brainstorm in more specific threads. I let the .tb feature as an Easter egg for now but I think I can do everything in nakepipe or most, and that I don't have to try to be consistent with .tb because their functionalities overlapped and I might not work more on it.

github-actions[bot] commented 2 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

moodymudskipper / nakedpipe

tb vs nakedpipe #21

using "self" in mutate calls by having function on rhs

`?` expressions to mutate_at / if

`:=` to evaluate lhs and use it as "at"

moodymudskipper / nakedpipe

tb vs nakedpipe #21

using "self" in mutate calls by having function on rhs

? expressions to mutate_at / if

:= to evaluate lhs and use it as "at"

`?` expressions to mutate_at / if

`:=` to evaluate lhs and use it as "at"