Open eriknw opened 11 years ago
This is a nice writeup and brings up the central issues.
Another failing example
# Straightforward
>>> data = [['abc', 'def'], ['hij', 'lmn']]
>>> concat(data)
['abc', 'def', 'hij', 'lmn']
# Issue arrises
>>> data = [['abc', 'def']]
>>> concat(data)
['a', 'b', 'c', 'd', 'e' ,'f'] # Got
['abc', 'def'] # Wanted
The alternative taken by concat
and concatv
is to separate each function into two functions, one that takes a single iterable (def f(seqs)
) and one that takes several iterables (def fv(*seqs)
). This is a more explicit but less convenient solution.
Do you have any thoughts on this @eigenhombre ?
In Matt's example, concat(data)
could also just be [['abc', 'def'], ['hij', 'lmn']]
if concat
's args were variadic, à la Clojure. It seems reasonable to me to call concat
on a single, nested data structure and to want it to be a NOP (e.g., the terminating case of a recursive function).
My gut instinct is that we're trying to make things easier[1] here, at the cost of adding complexity. And don't forget the Zen of Python: "explicit is better than implicit."
These things are also confusing in the Clojure API. See, e.g., this discussion[2] about a queue creation function. I think in general, you have to know what arguments the function expects. Consistency in the API certainly helps.
[1] http://www.infoq.com/presentations/Simple-Made-Easy [2] http://dev.clojure.org/jira/browse/CLJ-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#issue-tabs
@mrocklin, I don't think your failing example actually fails. I'll demonstrate:
>>> import itertools
>>>
>>> def concat(*seqs):
... try:
... if len(seqs) == 1 and iter(seqs[0][0]):
... seqs = seqs[0]
... except TypeError:
... pass
... return itertools.chain.from_iterable(seqs)
...
>>> data1 = [['abc', 'def'], ['hij', 'lmn']]
>>> data2 = [['abc', 'def']]
>>> list(concat(data1))
['abc', 'def', 'hij', 'lmn']
>>> list(concat(data2))
['abc', 'def']
Continuing the above, it only fails if one unpacks the arguments, which is unnecessary:
>>> list(concat(*data1))
['abc', 'def', 'hij', 'lmn']
>>> list(concat(*data2))
['a', 'b', 'c', 'd', 'e', 'f']
This will still work if we provide iterators instead of containers:
>>> data3 = [iter(['abc', 'def'])] # like `data2` above, but with an iterator
>>> list(concat(*data3))
['abc', 'def']
The current API is complicated. Virtually all types of conventions are used:
Oh, the above doesn't include functoolz
or itertools
.
I think it is easier to learn and use a single convention instead of learning the specifics of a dozen functions. Plus, more functions will certainly be added to toolz
. How does one choose which of the above conventions to use? Why was the current API chosen in the first place?
A pure convention would be for all functions to be one of variadic or non-variadic. But, this is not always practical.
I am afraid the extent of my analysis above may give the impression that the proposed convention is too complex. It is not. I did, however, want to be complete and honest in my first presentation of it. There is no urgency to this issue; please take your time to think it over and please don't rely on knee-jerk reactions.
I am still eager to hear of counter-examples where this convention fails for reasonable use cases.
Rule 1: avoid unpacking arguments, such as f(*seqs)
. If you have seqs
, then just pass it directly as f(seqs)
.
Rule 1: avoid unpacking arguments, such as f(*seqs). If you have seqs, then just pass it directly as f(seqs).
Agreed. But I think that this will be non-intuitive for novice users. I remember being frustrated by numpy.ones((n, m))
rather than numpy.ones(n, m)
Speaking as a recently novice user, I was confused why there were so many different conventions. It made learning the library more difficult. Heh, even though I listed above the functions for different conventions, I still don't know them.
The issue I have with offering two versions of every function is it clutters the namespace. A clean namespace is actually really, really nice.
Agreed. But I think that this will be non-intuitive for novice users. I remember being frustrated by numpy.ones((n, m)) rather than numpy.ones(n, m)
I don't intend to be dense, but I don't see how this example is relevant. You are not unpacking any values. Per the proposed convention, both f((n, m))
and f(n, m)
would work exactly as intended. Heh, and as you said, you were frustrated that this didn't work in numpy.
There is actually only one rule: A single sequence of iterables may be passed in place of the variadic arguments.
I bet we could make this even shorter and more clear, and it is becoming short enough to add to doctrings.
Two more negatives would occur by using this convention:
toolz
toolz
(1) is particularly important, because I believe users (including us) are expected to construct their own functions that are narrowly suitable for their specific problems.
The metafunc
package (here and #88) is progressing nicely (except for the documentation, which will be my next focus). I will soon create a proof-of-concept branch for toolz
that will generate a tooled
package from which higher-order functions may be chosen--and chained together--using import syntax. For example, from tooled.curried.traced import *
. It will also be easy for users to add their own higher- and first order functions.
Regarding variadic conventions, I intend to have higher-order functions that specify which variadic convention to use. I don't know how well this will play out, but I'm curious to try. One possible outcome is a separate tooled
project that depends on toolz
, metafunc
, and any other required or optional package used for higher-order functions. If you have thoughts on any of this, please share (and know that I don't offend easily; you guys haven't come close yet to offending me!).
I'm not particularly concerned about offending you, both because I lack tact and because I don't view this idea as offensive.
I like the idea behind this work. I like that it is happening outside of toolz (it feels orthogonal and magical). I also like that it is happening (it feels useful).
Making a separate package seems a little unscalable. Is there a way to include the package-name as an input to the import process?
from metafunc.toolz.curry.trace import *
I have no experience behind import magic so I'm not sure how helpful I'll be here.
I'm not particularly concerned about offending you, both because I lack tact and because I don't view this idea as offensive.
Great!
Is there a way to include the package-name as an input to the import process?
As in metafunc
doesn't depend on toolz
and vice versa, but the import statement you gave would "just work"? It would be possible to add this functionality to metafunc
by requiring a package to have private variables to define which functions to use as higher- and first order functions. I'll consider this, but my gut reaction is that I prefer explicit to implicit. Actually, your example doesn't separate hofs and fofs, because it just chains together existing functions, so no private variables would be necessary. metafunc
currently has a strict distinction between hofs and fofs such that the hof packages are not callable. For example, metafunc.toolz.curry(f)
is not currently allowed. It would be easy to allow this behavior, but my current thinking has been to separate hofs and fofs, and to make the so-called magic as straightforward and well-behaved as possible. Thanks for the question; it is definitely providing food for thought.
I have been thinking about variadic arguments with a goal of making the
itertoolz
API "just work" in a way that is user-friendly and intuitive. I am proposing a uniform API that allows functions to be used variadically and non-variadically by accepting a sequence container of iterables.In
dicttoolz
,merge
andmerge_with
accept dicts variadically as*dicts
. This may be used variadically and non-variadically. For example, these are equivalent:merge(d1, d2)
andmerge([d1, d2])
. This is very convenient. Furthermore, there can be no ambiguity over what is intended, because it checks input fordict
type.In
itertoolz
, we cannot check for types as indictoolz
, so the same strategy of allowing variadic and non-variadic input can result in ambiguities. Ambiguities can lead to incorrect edge cases that stymie novices and experts alike. This is bad... right?An API that follows several competing conventions is also bad. It makes the library harder to learn and use. When does a function accept multiple inputs variadically? Which ones require a sequence of iterables? Should there be duplicates of functions that handle inputs differently? Why does X do this and not Y? A unified flexible API cannot completely avoid edge cases under all uses, but it can control them, and I believe the benefits are worth the costs.
First, a summary. Here is a list of
itertoolz
(anditertools
) functions that accept variadic input of iterables:merge_sorted
,intersection
,concatv
(chain
,map
,zip
,product
). Here are the functions that accept a sequence of iterables:interleave
,concat
,mapcat
(chain.from_iterable
,starmap
).TL;DNR: Read Here
The proposal: if the variadic argument is a single sequence container of iterables, then unpack the container and treat the iterables as arguments. (If we document this somewhere, somebody else should come up with better wording!). For example,
f([[1, 2], [3, 4]])
is the same asf([1, 2], [3, 4])
. The simplest way to show this condition programmatically is:Note that we can add more conditionals to the above in order to avoid the performance penalty of raising and catching exceptions.
Behaviors:
Hence, the only failure is when the variadic form is attempted on a single nested container. For example,
f([1, 2])
will work, butf([(1, 2)])
will not, because a single nested sequence will be interpreted as a sequence of iterables.Here are more examples of sequences with a nested data type:
seqs1 = [[[1, 2]]]
f(seqs1)
works (2)f(*seqs1)
failsseqs2 = [iter([[1, 2]])]
f(seqs2)
works (1, 2)f(*seqs2)
works (1)seqs3 = [[[1, 2]], [[3, 4]]]
f(seqs3)
works (2, 3)f(*seqs3)
works (3)Hence, failures can easily be avoided by not expanding the arguments (i.e., the
*seqs
operation).There is one final ambiguity:
f([])
. Is this an empty list, or no input? We can make the two cases equivalent.We lose one more thing by using this convention:
seqs
cannot generally be an iterator of iterables, because we require it to be a sequence container of iterables. This is the same convention used indicttoolz.merge
, which accepts a sequence--but not iterator--of dicts. Fortunately, an iterator of iterables can be easily handled asf(list(seqs))
orlist(*seqs)
, although the latter version with unpacking has the same conditions as stated earlier. The reason for not supporting an iterator of iterables is simple: nested data is common, and we should support it as painlessly as possible without introducing weird edge cases. As long as one works with iterators over data (or do not use the variadic form, or explicitly use the variadic form with more than one input), this convention is guaranteed to work regardless of data type.I don't know about you guys, but this convention would "just work" for virtually everything I do. However, I don't doubt that edge cases can arise, and my work flow is not necessarily the same as yours or other users. I don't claim to have thought about everything. I welcome discussion and counter-examples.