rougier / from-python-to-numpy

An open-access book on numpy vectorization techniques, Nicolas P. Rougier, 2017
http://www.labri.fr/perso/nrougier/from-python-to-numpy
Other
2.05k stars 339 forks source link

2.1 Introduction - Simple Example, procedural example #43

Open tweigel-vaisala opened 7 years ago

tweigel-vaisala commented 7 years ago

The random_walk function is not a direct equivalent to the RandomWalker class method. A strict equivalent would be this:

def random_walk(n):
    position = 0
    for i in range(n):
        yield position
        position += 2*random.randint(0,1)-1

It is still not much faster, but it's a more fair comparison.

rougier commented 7 years ago

Yes, you're right but it does as you noticed it does not make a real difference.

datajanko commented 7 years ago

additionally, in def random_walk_faster(n=1000): from itertools import accumulate steps = random.sample((-1,+1)*n, k=n) return [0]+list(accumulate(steps)) ` the random numbers are not drawn from the same distribution as in the other variants. And, in the timing you also measure the import costs of accumulate. However, there is no huge difference in computation time

rougier commented 7 years ago

Yes, it has be noticed in #37 but the proposed solution require Python 3.6. However, from the documentation, the proposed sampling method should do a shuffle.

datajanko commented 7 years ago

Sorry, I didn't read the closed threads!

A list comprehension like

steps = [2* random.randint(0,1)-1 for _ in range(n)]

won't require python 3.6. You may even want to use a generator expression, which might be more efficient.

rougier commented 7 years ago

Oh nice, I like it. We need to benchmark it and if it faster or approximately the same, we can include it. Could you make a PR ?

datajanko commented 7 years ago

Benchmarks

def random_walk(n):
    position = 0
    walk = [position]
    for i in range(n):
        position += 2*random.randint(0, 1)-1
        walk.append(position)
    return walk
def random_walk_faster(n=1000):
    steps = random.sample((-1,+1)*n, k=n)
    return [0]+list(accumulate(steps))
def random_walk_faster_2(n=1000):
    steps = [2*random.randint(0,1)-1 for _ in range(n)]
    return [0]+list(accumulate(steps))
def random_walk_faster_3(n=1000):
    steps = (2*random.randint(0,1)-1 for _ in range(n))
    return [0]+list(accumulate(steps))

%%timeit
random_walk(100000)
10 loops, best of 3: 141 ms per loop

%%timeit 
random_walk_faster(100000)
10 loops, best of 3: 78.1 ms per loop

%%timeit
random_walk_faster_2(100000)
10 loops, best of 3: 135 ms per loop

%%timeit
random_walk_faster_3(100000)
10 loops, best of 3: 139 ms per loop

So it looks like, that using random.sample from a false distribution yields a huge speedup. The naive approach can compete with the list comprehension version.

So the problem here is essentially, that numpy improves two things. Sampling and aggregating. Maybe an example, that used the numpy sampling, but accumulate in aggregating might be a better solution. Unfortunately, the random.choices function only has been introduced in python 3.6. If it is faster than the list comprehension, using choices probably would be the best solution in terms of didactics

rougier commented 7 years ago

Sorry, in the end I see your PR but the problem now is that the proposed solution does not correspond to the text so it is not good. What exactly is wrong with the random.sample method ? I think documentation says in some circumstances (k=len(x)) it is equivalent to a shuffle, no ?

datajanko commented 7 years ago

Hey, that's what I feared. So why might the use of random.sample be a bad choice? Firstly, it is absolutely not clear why using sample will provide the same algorithm.

Secondly, you are reducing the variance/standard deviation by a huge amount. Though you approximate the same mean for large repetions, the distribution you'll use will be too narrow. This solution penalizes solutions that are far out too much.

I think, this is easiest seen for the case n=2. Using your sampling method: you find after the first draw a probability of 2/3 to move in the other direction. With the usual sampling the probability is still 1/2. Hence, your method, will have a probability of 2/3 to end in the center and probability 1/6 to end at two places left or right off the center. The other approach gives probabilities of 1/4, 1/2 and 1/4.

This difference in variance carries over to higher number of repetitions (and it is significant!)

So, one way would be to use np.randint() in all of the functions, maybe the best way would just be to comment on the issues with the variance and keep your solution.