numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
27.54k stars 9.85k forks source link

ENH: equivalent to MATLAB's colon operator #16124

Open allefeld opened 4 years ago

allefeld commented 4 years ago

Coming to Python/NumPy from Matlab, I'm missing something like Matlab's three-argument colon-operator:

>> 0:0.1:1
ans =
  Columns 1 through 3
                         0                       0.1                       0.2
  Columns 4 through 6
                       0.3                       0.4                       0.5
  Columns 7 through 9
                       0.6                       0.7                       0.8
  Columns 10 through 11
                       0.9                         1

In Matlab, I used this operator routinely. It is especially useful when creating a range of values for plotting a function, e.g. x = -1:0.1:1; plot(x, x.^2).

NumPy's alternatives for this are linspace and arange. However, the former makes it hard to hit specific values (in the plotting example, I have to figure out that there are 21 values), and the latter excludes the stop value. I get that arange follows the behavior of range, but for the mentioned application which has nothing to do with indexing it is just impractical.

I also understand that a naive implementation of the colon operator runs into floating point problems. However, in many years of using Matlab I never ran into a situation where it didn't do the expected thing. I now managed to find an archived blog post that explains part of their implementation, with an attached complete implementation. I tested it, and as far as I can see it does exactly what the colon operator does.

Would you be open to including an implementation of this algorithm in NumPy? It could be called colon to make it easy for people switching, or irange where the i stands for "inclusive". For example with a signature irange(start, stop, step=1).

rossbar commented 4 years ago

There are mgrid and ogrid that create arrays using "slice-like" syntax, i.e [start:end:step], though they are still exclude the endpoint so are not quite the same:

>>> np.mgrid[0:1:0.1]
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

AFAIK there is no option to change whether the end point is included nor any particular special attention paid to floating point error, so it's not exactly the same, but it exists just FYI.

allefeld commented 4 years ago

There are mgrid and ogrid

yes, they are basically multidimensional equivalents of arange, so they suffer from the same problem.

rossbar commented 4 years ago

so they suffer from the same problem.

Just to clarify - the problem you're referring to in this case is the fact that the endpoint is not included, correct? Or is it related to the syntax?

allefeld commented 4 years ago

Yes, exactly.

seberg commented 4 years ago

There is a fun, rarely known and maybe a bit ugly, hack: np.ogrid[1:2:10j] will give you linspace behaviour. Of course using that without a hardcoded step would be evil IMO, but if you must use shorthands... The other option is using np.ix_ as a shorthand for adding dimensions like ogrid, etc.

To be honest, I am not sure I like any of these options (i.e. whether I would want to teach them to new users mainly). I think discussing such an addition or proliferation of such a syntax would be nice to start on the mailing list though.

I did once had a PR to add np.linspace(..., step=0.1), but... Either that ends up super unsafe, or you need to be super strict about getting the step right, at which point it is only reasonable to use with a hardcoded step, not a calculated one. So I buried it, because I was not convinced its worth the trouble and enough need exists...

rossbar commented 4 years ago

Another option might be to add an endpoint kwarg to arange (there is one in linspace) to achieve the desired behavior, though I seem to remember comments from other discussions expressing dislike for using arange for this purpose anyway.

allefeld commented 4 years ago

@seberg

but if you must use shorthands

I would like to quote the Zen of Python: "explicit is better than implicit". It's not about a shorthand, but about writing what I mean. I don't really want 101 points, I want steps of 0.01.

Either that ends up super unsafe, or you need to be super strict about getting the step right, at which point it is only reasonable to use with a hardcoded step, not a calculated one.

I meant to have addressed that in the first comment. I agree it is hard to get right, but the good folks at Mathworks seem to have successfully done so.

The ogrid trick is interesting to know, but doesn't really address my request, because it is still "number of points"-based.

I don't like to subscribe to mailing lists because they are hard to manage (filters never work 100%) in my experience, the issues interface is so much nicer. But if that is the way things proceed in NumPy, ok.

@rossbar, I would prefer not to add this as an option to arange or some other existing function, because internally it is not an extension of its functionality (even though it looks like that). The values from arange will not in general be numerically identical to the first n – 1 values from the quoted algorithm.

seberg commented 4 years ago

@allefeld, I am not too worried about getting the points right, once you figured out the number of steps (but then I assume linspace does so). But what do you do if I use 1:2:0.13 which just does not work from a linspace perspective where the end-point should be included (or things should be symmetric for -1:1:0.13

allefeld commented 4 years ago

@seberg the intended behavior of the algorithm – in a world of infinite precision – is to include all 1 + 0.13 * i which are smaller than or equal to 2. Comment from the code:

%   Ideally, in exact arithmetic, for b > a and d > 0,
%   v = a:d:b should be the vector of length n+1 generated by
%   v = a + (0:n)*d where n = floor((b-a)/d).

In this case, the intended result is

1, 1.13, 1.26, 1.39, 1.52, 1.65, 1.78, 1.91

That means that if the step is an integer fraction of the interval length, the endpoint is one of the values, and otherwise it isn't.


With floating point arithmetic the problem is that, if the step is (intended as) an integer fraction of the interval length, e.g. 0.1, the value closest to the 2 might be slightly larger or slightly smaller than 2 due to rounding error. That is fixed by transforming the step into a number of points n = round((b-a)/d).

The tricky bit is to figure out whether "the step is intended as an integer fraction of the interval length". The Mathworks algorithm decides that on the basis of whether the error at the end is smaller than

tol = 2.0*eps*max(abs(a),abs(b))

where eps is the difference between 1 and the next larger number that can be represented as a double precision floating point number.

We then still have the issue that the last value, in this case 1 + 0.1 * n, will not be exactly 2. The Mathworks algorithm fixes that by creating the first half of the values upwards from 1, and the second half downwards from 2. This makes the floating point error about half as large on average, and ensures that the values are symmetrically distributed over the interval.


Actually I used the "non-integer fraction" behavior as a hack in Matlab in case I didn't want to include the endpoint. E.g. 1:0.1:1.99. That is kind of the reverse ugliness, and something that is as straightforward as it should be in NumPy. In my mind, having arange and my proposed irange would be the best of both worlds.

seberg commented 4 years ago

Well maybe @eric-wieser had a nice proposal (I think he was arguing this stuff before), this must have been discussed many times over the years.

I just do not think we can fix arange, so yes irange is maybe the best of both worlds. I am not immediately clear that the best of both worlds is actually very good, but I tend to be pessimistic :). I.e. it likely has worse properties then linspace? Long story short. Likely others will be convinced, and than I will be happy even if I am not, but I expect someone will have to flesh out the proposal a bit more and in either case write to the mailing list to get some support and get it rolling...

eric-wieser commented 4 years ago

I think we need to at spend some effort collecting together old bug reports on this, because we keep repeating ourselves in addressing them.

@allefeld, I think a lot of the responses you got here ended up being about the syntactic sugar of the matlab : operator, when really all you cared about was the behavior of the function colon(a, b, c). Sorry about that.