Open machow opened 4 years ago
@machow , have you taken a look at sspipe module? It is a good project and might serve you as a reference
Ah, I hadn't--thanks, this looks perfect! I think before I was hesitant to bake the pipe behavior into _
, but seeing those examples, it def seems worth it :o.
I see sspipe
and siuba
as groundbreaking packages and I believe the combination of both can be super powerful, so I'm glad it was helpful, @machow
What about not overloading an operator at all, but using a pipe function?
There's already one in the functoolz package: pipe
For instance:
from siuba.data import mtcars
from siuba import select
from toolz.functoolz import pipe
pipe(
mtcars,
select("mpg", "cyl", "disp"),
lambda _: _.columns.values,
)
I like this in particular because it doesn't require everything to be a pipeable. For instance I can directly pipe into the lambda
without using siuba's pipe
verb. I don't find it less explicit or harder to read, and it doesn't "abuse" the >>
operator.
EDIT: the examples from above work, too:
pipe(
mtcars,
mutate(res=_.hp + 2),
_.set_index("res"),
_.assign(a="b"),
)
pipe(
["a", "b"],
"".join,
_.upper(),
)
Hmm.. yeah--so the naming may be unfortunate, since siuba's pipe
is meant to be analogous to the DataFrame.pipe method, but I think your suggestion makes sense.
I'm thinking about for now adding a function, call
, to do what the pandas' pipe method does, while also allowing chaining with >>
. So to begin with...
from siuba import *
from siuba.data import mtcars
def some_func(a, data):
print(a)
return data
# using new call function
mtcars >> call(some_func, 1, data=_)
# equivalent to
pipe(mtcars, lambda _: some_func(1, data=_))
# longwinded support for method chaining / whatnot
mtcars >> call(_[_.gear < 4])
Then, once I finish a big refactor of siuba's internals (that's almost done?! 😬), I think it will be easier to add in the sspipe
like behavior...
# with sspipe behavior
(mtcars
>> _[_.gear < 4]
>> call(some_func, 1, data=_)
)
# with pipe
pipe(
mtcars,
_[_.gear < 4],
lambda d: some_func(1, data=d),
)
I think that intuitively, there's something that feels simpler to me about reading code with the overloaded operator. It seems like knowing that things execute piece by piece (since the pipe w/ >>
is all binary operations) feels simpler than an outer function (even if it's a dead simple function). But there are also a lot of places a pipe
function would be useful (and I could be totally wrong with preferring an operator ;).
It seems like knowing that things execute piece by piece (since the pipe w/
>>
is all binary operations) feels simpler than an outer function (even if it's a dead simple function).
pipe()
can be usefuI indeed, but I agree with the above comment. Also, the statements in pipe()
are separated by ,
which is visually confusing.
It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function).
I know what you mean, I'm just wondering if a pure python developer who has never seen dplyr
in action would not find the overloaded >>
more confusing.
In any case, your proposal from above sounds great! It also seems that the sspipe behaviour will "just work" with the functoolz.pipe
function without additional effort. I also like the idea of renaming siuba.pipe
to siuba.call
. Even if it does not match the pandas API, I find it has the clearer semantics.
btw, the macropy solution here also looks intriguing. but probably you've already seen it at some point.
Related issues
246
Below I lay out three challenges for piping. As a precursor to them it's worth noting that there is a delicate balance between a piping strategy that is...
Ideally a pipe should
f = some_pipe; f(data)
data >> some_pipe
some_pipe >> pipe(_.method1().method2())
some_pipe >> _.method1().method2()
(all examples for illustration, not saying that's how it should happen)
With method chains
Suppose we want to do a mutate, and then set it as the index...
If we wanted to keep chaining data frame methods, we would have to either...
This is because the dot operator has higher precedence than
>>
, so gets evaluated first.One solution could be adding a syntax where method chaining off a pipe, like
pipe(...).method1().method2()
produced a pipe. But I don't think we need more getattr magic.A final option is presented in the "Simplifying piping to a Symbol" section. This would be to make a Symbol's default behavior for
>>
to be to produce a Pipeable.Eager piping with two starting non-pipe funcs
the conditions below are useful for a lazy, but not an eager pipe
this is because...
>>
is left associativeIf the reverse case
<<
were supported, it would be fine...But using both approaches would create a dumb decision for users.
Another workaround would be putting the lazy pipe part in parentheses...
But this removes an advantage of the normal eager pipe, which is it evaluates line-by-line. That is...
Simplifying piping to a Symbol or Call
If we wanted to declutter piping, we could change the way >> worked on a symbol to go from...
to
This would add an extra caveat to siu expressions
_
, which right now basically have very few rules to learn.References