Variadic argument conventions

eriknw commented 11 years ago

I have been thinking about variadic arguments with a goal of making the itertoolz API "just work" in a way that is user-friendly and intuitive. I am proposing a uniform API that allows functions to be used variadically and non-variadically by accepting a sequence container of iterables.

In dicttoolz, merge and merge_with accept dicts variadically as *dicts. This may be used variadically and non-variadically. For example, these are equivalent: merge(d1, d2) and merge([d1, d2]). This is very convenient. Furthermore, there can be no ambiguity over what is intended, because it checks input for dict type.

In itertoolz, we cannot check for types as in dictoolz, so the same strategy of allowing variadic and non-variadic input can result in ambiguities. Ambiguities can lead to incorrect edge cases that stymie novices and experts alike. This is bad... right?

An API that follows several competing conventions is also bad. It makes the library harder to learn and use. When does a function accept multiple inputs variadically? Which ones require a sequence of iterables? Should there be duplicates of functions that handle inputs differently? Why does X do this and not Y? A unified flexible API cannot completely avoid edge cases under all uses, but it can control them, and I believe the benefits are worth the costs.

First, a summary. Here is a list of itertoolz (and itertools) functions that accept variadic input of iterables: merge_sorted, intersection, concatv (chain, map, zip, product). Here are the functions that accept a sequence of iterables: interleave, concat, mapcat (chain.from_iterable, starmap).

TL;DNR: Read Here

The proposal: if the variadic argument is a single sequence container of iterables, then unpack the container and treat the iterables as arguments. (If we document this somewhere, somebody else should come up with better wording!). For example, f([[1, 2], [3, 4]]) is the same as f([1, 2], [3, 4]). The simplest way to show this condition programmatically is:

def f(*seqs):
    try:
        if len(seqs) == 1 and iter(seqs[0][0]):
            seqs = seqs[0]
    except TypeError:
        pass

Note that we can add more conditionals to the above in order to avoid the performance penalty of raising and catching exceptions.

Behaviors:

always works if iterators over data are used
always works if non-variadic form is used
always works with more than one input (for variadic and non-variadic)
always works if variadic form is used with one input that is not a nested sequence

Hence, the only failure is when the variadic form is attempted on a single nested container. For example, f([1, 2]) will work, but f([(1, 2)]) will not, because a single nested sequence will be interpreted as a sequence of iterables.

Here are more examples of sequences with a nested data type:

seqs1 = [[[1, 2]]]
- f(seqs1) works (2)
- f(*seqs1) fails
seqs2 = [iter([[1, 2]])]
- f(seqs2) works (1, 2)
- f(*seqs2) works (1)
seqs3 = [[[1, 2]], [[3, 4]]]
- f(seqs3) works (2, 3)
- f(*seqs3) works (3)

Hence, failures can easily be avoided by not expanding the arguments (i.e., the *seqs operation).

There is one final ambiguity: f([]). Is this an empty list, or no input? We can make the two cases equivalent.

We lose one more thing by using this convention: seqs cannot generally be an iterator of iterables, because we require it to be a sequence container of iterables. This is the same convention used in dicttoolz.merge, which accepts a sequence--but not iterator--of dicts. Fortunately, an iterator of iterables can be easily handled as f(list(seqs)) or list(*seqs), although the latter version with unpacking has the same conditions as stated earlier. The reason for not supporting an iterator of iterables is simple: nested data is common, and we should support it as painlessly as possible without introducing weird edge cases. As long as one works with iterators over data (or do not use the variadic form, or explicitly use the variadic form with more than one input), this convention is guaranteed to work regardless of data type.

I don't know about you guys, but this convention would "just work" for virtually everything I do. However, I don't doubt that edge cases can arise, and my work flow is not necessarily the same as yours or other users. I don't claim to have thought about everything. I welcome discussion and counter-examples.

mrocklin commented 11 years ago

This is a nice writeup and brings up the central issues.

Another failing example

    # Straightforward
    >>> data = [['abc', 'def'], ['hij', 'lmn']]   
    >>> concat(data)
    ['abc', 'def', 'hij', 'lmn']

    # Issue arrises
    >>> data = [['abc', 'def']]
    >>> concat(data)
    ['a', 'b', 'c', 'd', 'e' ,'f']  # Got
    ['abc', 'def']  # Wanted

The alternative taken by concat and concatv is to separate each function into two functions, one that takes a single iterable (def f(seqs)) and one that takes several iterables (def fv(*seqs)). This is a more explicit but less convenient solution.

Do you have any thoughts on this @eigenhombre ?

eigenhombre commented 11 years ago

In Matt's example, concat(data) could also just be [['abc', 'def'], ['hij', 'lmn']] if concat's args were variadic, à la Clojure. It seems reasonable to me to call concat on a single, nested data structure and to want it to be a NOP (e.g., the terminating case of a recursive function).

My gut instinct is that we're trying to make things easier[1] here, at the cost of adding complexity. And don't forget the Zen of Python: "explicit is better than implicit."

These things are also confusing in the Clojure API. See, e.g., this discussion[2] about a queue creation function. I think in general, you have to know what arguments the function expects. Consistency in the API certainly helps.

[1] http://www.infoq.com/presentations/Simple-Made-Easy [2] http://dev.clojure.org/jira/browse/CLJ-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#issue-tabs

eriknw commented 11 years ago

@mrocklin, I don't think your failing example actually fails. I'll demonstrate:

>>> import itertools
>>> 
>>> def concat(*seqs):
...     try:
...         if len(seqs) == 1 and iter(seqs[0][0]):
...             seqs = seqs[0]
...     except TypeError:
...         pass
...     return itertools.chain.from_iterable(seqs)
... 
>>> data1 = [['abc', 'def'], ['hij', 'lmn']]
>>> data2 = [['abc', 'def']]
>>> list(concat(data1))
['abc', 'def', 'hij', 'lmn']
>>> list(concat(data2))
['abc', 'def']

eriknw commented 11 years ago

Continuing the above, it only fails if one unpacks the arguments, which is unnecessary:

>>> list(concat(*data1))
['abc', 'def', 'hij', 'lmn']
>>> list(concat(*data2))
['a', 'b', 'c', 'd', 'e', 'f']

This will still work if we provide iterators instead of containers:

>>> data3 = [iter(['abc', 'def'])]  # like `data2` above, but with an iterator
>>> list(concat(*data3))
['abc', 'def']

eriknw commented 11 years ago

Zen: "Complex is better than complicated."

The current API is complicated. Virtually all types of conventions are used:

Variadic:
- merge_sorted
- intersection
- concatv
Non-variadic:
- interleave
- concat
- mapcat
Variadic or non-variadic:
- merge
- merge_with
Separate functions for variadic and non-variadic:
- concat
- concatv

Oh, the above doesn't include functoolz or itertools.

I think it is easier to learn and use a single convention instead of learning the specifics of a dozen functions. Plus, more functions will certainly be added to toolz. How does one choose which of the above conventions to use? Why was the current API chosen in the first place?

Zen: "Although practicality beats purity."

A pure convention would be for all functions to be one of variadic or non-variadic. But, this is not always practical.

I am afraid the extent of my analysis above may give the impression that the proposed convention is too complex. It is not. I did, however, want to be complete and honest in my first presentation of it. There is no urgency to this issue; please take your time to think it over and please don't rely on knee-jerk reactions.

I am still eager to hear of counter-examples where this convention fails for reasonable use cases.

Rule 1: avoid unpacking arguments, such as f(*seqs). If you have seqs, then just pass it directly as f(seqs).

mrocklin commented 11 years ago

Rule 1: avoid unpacking arguments, such as f(*seqs). If you have seqs, then just pass it directly as f(seqs).

Agreed. But I think that this will be non-intuitive for novice users. I remember being frustrated by numpy.ones((n, m)) rather than numpy.ones(n, m)

eriknw commented 11 years ago

Speaking as a recently novice user, I was confused why there were so many different conventions. It made learning the library more difficult. Heh, even though I listed above the functions for different conventions, I still don't know them.

The issue I have with offering two versions of every function is it clutters the namespace. A clean namespace is actually really, really nice.

eriknw commented 11 years ago

Agreed. But I think that this will be non-intuitive for novice users. I remember being frustrated by numpy.ones((n, m)) rather than numpy.ones(n, m)

I don't intend to be dense, but I don't see how this example is relevant. You are not unpacking any values. Per the proposed convention, both f((n, m)) and f(n, m) would work exactly as intended. Heh, and as you said, you were frustrated that this didn't work in numpy.

eriknw commented 11 years ago

There is actually only one rule: A single sequence of iterables may be passed in place of the variadic arguments.

I bet we could make this even shorter and more clear, and it is becoming short enough to add to doctrings.

eriknw commented 10 years ago

Two more negatives would occur by using this convention:

it would become more difficult for users to mimic the API of toolz
it would become more difficult for users to extract functions from toolz

(1) is particularly important, because I believe users (including us) are expected to construct their own functions that are narrowly suitable for their specific problems.

eriknw commented 10 years ago

The metafunc package (here and #88) is progressing nicely (except for the documentation, which will be my next focus). I will soon create a proof-of-concept branch for toolz that will generate a tooled package from which higher-order functions may be chosen--and chained together--using import syntax. For example, from tooled.curried.traced import *. It will also be easy for users to add their own higher- and first order functions.

Regarding variadic conventions, I intend to have higher-order functions that specify which variadic convention to use. I don't know how well this will play out, but I'm curious to try. One possible outcome is a separate tooled project that depends on toolz, metafunc, and any other required or optional package used for higher-order functions. If you have thoughts on any of this, please share (and know that I don't offend easily; you guys haven't come close yet to offending me!).

mrocklin commented 10 years ago

I'm not particularly concerned about offending you, both because I lack tact and because I don't view this idea as offensive.

I like the idea behind this work. I like that it is happening outside of toolz (it feels orthogonal and magical). I also like that it is happening (it feels useful).

Making a separate package seems a little unscalable. Is there a way to include the package-name as an input to the import process?

from metafunc.toolz.curry.trace import *

I have no experience behind import magic so I'm not sure how helpful I'll be here.

eriknw commented 10 years ago

I'm not particularly concerned about offending you, both because I lack tact and because I don't view this idea as offensive.

Great!

Is there a way to include the package-name as an input to the import process?

As in metafunc doesn't depend on toolz and vice versa, but the import statement you gave would "just work"? It would be possible to add this functionality to metafunc by requiring a package to have private variables to define which functions to use as higher- and first order functions. I'll consider this, but my gut reaction is that I prefer explicit to implicit. Actually, your example doesn't separate hofs and fofs, because it just chains together existing functions, so no private variables would be necessary. metafunc currently has a strict distinction between hofs and fofs such that the hof packages are not callable. For example, metafunc.toolz.curry(f) is not currently allowed. It would be easy to allow this behavior, but my current thinking has been to separate hofs and fofs, and to make the so-called magic as straightforward and well-behaved as possible. Thanks for the question; it is definitely providing food for thought.

pytoolz / toolz