machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

Composing pipes #207

Open machow opened 4 years ago

machow commented 4 years ago

Related issues

Below I lay out three challenges for piping. As a precursor to them it's worth noting that there is a delicate balance between a piping strategy that is...

Ideally a pipe should

(all examples for illustration, not saying that's how it should happen)

With method chains

Suppose we want to do a mutate, and then set it as the index...

from siuba.data import mtcars
from siuba import _, mutate, pipe

(
    mtcars 
    >> mutate(res = _.hp + 2)
    >> pipe(_.set_index("res"))
)

If we wanted to keep chaining data frame methods, we would have to either...

  1. add a pipe for every method in the chain
  2. add a single pipe, with the whole chain
# approach 1
(
    mtcars 
    >> mutate(res = _.hp + 2)
    >> pipe(_.set_index("res"))
    >> pipe(_.assign(a = "b"))
)

# approach 2
(
    mtcars 
    >> mutate(res = _.hp + 2)
    >> pipe(_
      .set_index("res")
      .assign(a = "b")
    )
   # or  pipe(_.set_index(...).assign(...))
)

This is because the dot operator has higher precedence than >>, so gets evaluated first.

One solution could be adding a syntax where method chaining off a pipe, like pipe(...).method1().method2() produced a pipe. But I don't think we need more getattr magic.

A final option is presented in the "Simplifying piping to a Symbol" section. This would be to make a Symbol's default behavior for >> to be to produce a Pipeable.

Eager piping with two starting non-pipe funcs

the conditions below are useful for a lazy, but not an eager pipe

# lazy case works
f = "".join >> pipe(_.upper())
f(['a', 'b'])

# eager case raises error
['a', 'b'] >> "".join >> pipe(_.upper())

this is because...

If the reverse case << were supported, it would be fine...

pipe(_.upper()) << "".join << ['a', 'b']

But using both approaches would create a dumb decision for users.

Another workaround would be putting the lazy pipe part in parentheses...

['a', 'b'] >> ("".join >> pipe(_.upper()))

But this removes an advantage of the normal eager pipe, which is it evaluates line-by-line. That is...

# very explicit, reference implementation
(
    ['a', 'b']
    >> pipe("".join)                # runs in python first
    >> pipe(_.uppsldkfjer())        # runs in python second
    >> pipe(_.upper())              # error above before evaluating this line
)

Simplifying piping to a Symbol or Call

If we wanted to declutter piping, we could change the way >> worked on a symbol to go from...

"".join >> pipe(_.upper())

to

"".join >> _.upper()

This would add an extra caveat to siu expressions _, which right now basically have very few rules to learn.

References

GitHunter0 commented 3 years ago

@machow , have you taken a look at sspipe module? It is a good project and might serve you as a reference

machow commented 3 years ago

Ah, I hadn't--thanks, this looks perfect! I think before I was hesitant to bake the pipe behavior into _, but seeing those examples, it def seems worth it :o.

GitHunter0 commented 3 years ago

I see sspipe and siuba as groundbreaking packages and I believe the combination of both can be super powerful, so I'm glad it was helpful, @machow

grst commented 3 years ago

What about not overloading an operator at all, but using a pipe function? There's already one in the functoolz package: pipe

For instance:

from siuba.data import mtcars
from siuba import select
from toolz.functoolz import pipe
pipe(
    mtcars,
    select("mpg", "cyl", "disp"),
    lambda _: _.columns.values,
)

I like this in particular because it doesn't require everything to be a pipeable. For instance I can directly pipe into the lambda without using siuba's pipe verb. I don't find it less explicit or harder to read, and it doesn't "abuse" the >> operator.


EDIT: the examples from above work, too:

pipe(
    mtcars,
    mutate(res=_.hp + 2),
    _.set_index("res"),
    _.assign(a="b"),
)
pipe(
    ["a", "b"],
    "".join,
    _.upper(),
)
machow commented 3 years ago

Hmm.. yeah--so the naming may be unfortunate, since siuba's pipe is meant to be analogous to the DataFrame.pipe method, but I think your suggestion makes sense.

I'm thinking about for now adding a function, call, to do what the pandas' pipe method does, while also allowing chaining with >>. So to begin with...

from siuba import *
from siuba.data import mtcars

def some_func(a, data):
    print(a)
    return data

# using new call function
mtcars >> call(some_func, 1, data=_)

# equivalent to
pipe(mtcars, lambda _: some_func(1, data=_))

# longwinded support for method chaining / whatnot
mtcars >> call(_[_.gear < 4])

Then, once I finish a big refactor of siuba's internals (that's almost done?! 😬), I think it will be easier to add in the sspipe like behavior...

# with sspipe behavior
(mtcars
    >> _[_.gear < 4]
    >> call(some_func, 1, data=_)
)

# with pipe
pipe(
    mtcars,
    _[_.gear < 4],
    lambda d: some_func(1, data=d),
)

I think that intuitively, there's something that feels simpler to me about reading code with the overloaded operator. It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function). But there are also a lot of places a pipe function would be useful (and I could be totally wrong with preferring an operator ;).

GitHunter0 commented 3 years ago

It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function).

pipe() can be usefuI indeed, but I agree with the above comment. Also, the statements in pipe() are separated by , which is visually confusing.

grst commented 3 years ago

It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function).

I know what you mean, I'm just wondering if a pure python developer who has never seen dplyr in action would not find the overloaded >> more confusing.

In any case, your proposal from above sounds great! It also seems that the sspipe behaviour will "just work" with the functoolz.pipe function without additional effort. I also like the idea of renaming siuba.pipe to siuba.call. Even if it does not match the pandas API, I find it has the clearer semantics.


btw, the macropy solution here also looks intriguing. but probably you've already seen it at some point.