neonwatty / machine_learning_refined

Notes, examples, and Python demos for the 2nd edition of the textbook "Machine Learning Refined" (published by Cambridge University Press).
Other
1.67k stars 603 forks source link

Exercise 2.1 - Uniform sampling #29

Closed theundergroundsorcerer closed 1 year ago

theundergroundsorcerer commented 1 year ago

I couldn't find any reference in the errata, hence may be it is not an error - but how does someone sample $K$ points from $[-1,1]^N$ when $K$ is not $N-th$ power? From reading the text and general knowledge, I thought that uniform sampling meant creating a $N$ dimensional grid of evenly spaced points - but this is obviously not possible when the dimension is not $1$ or $2$ when $N=100$. Does it mean something else? If it does - could it be specified explicitly in the errata?

neonwatty commented 1 year ago

Great question! Let me know if I'm tracing out our thinking correctly.

In this context sampling on [-1,1] means chosing any real point on the interval starting at -1 and ending at 1, and the 'uniform' part means to produce this sample with respect to the uniform distribution on that interval. A (randomly) chosen point could be any real value inbetween -1 and 1 inclusive (e.g., -0.23218, 0.38482,...).

The N dimensional hypercube

[-1,1] X [1,1] X ... X [-1,1]

is the higher dimensional analog of the interval [-1,1]. Choosing a point in the cube means selecting a real valued N dimensional vector like

[0.2322, -0.7327,...,0.998]

To do this uniformly just means to sample points like this according to a uniform distribution. Because the desired distribution is uniform, you can just sample each coordinate independently according to a uniform on [-1,1]. In other words, to produce an N dimensional point chosen uniformly in the hypercube described above, you can create an N length array where each corrdinate is sampled uniformly on the interval [-1,1].

Here a phrase like "sample $K$ points uniformly from $[-1,1]^N$' means to do this $K$ times - producing $K$ such vectors.

Does that make sense / did I interpret you correctly?

theundergroundsorcerer commented 1 year ago

Thank you for your reply.

I am not sure that I accept your explanation because it doesn't match the description of uniform sampling strategy in the text, and it is reasonable to assume that the definitions in the text and the exercises coincide, unless explicitly stated otherwise.

What do you mean here by "random", then? Other than that, on page 24 when discussing global optimization the last paragraph says:

We can take two approaches here to choosing our (finite) set of input points to test: either sample (i.e., guess) them uniformly over an evenly spaced grid (uniform sampling), or pick the same number of input points at random (random sampling).

It doesn't mention uniform distribution, but creating a grid (unless I've got something wrong when reading it) of appropriate dimension (the discussion in section 2.3.1 - The curse of dimensionality) seems to match this interpretation of "uniform sampling" (grid of appropriate dimension).

Other than that, if we accept your explanation, how should part (c) of the question be interpreted? c).

Part (c) of the question says:

(c) Repeat parts (a) and (b), this time replacing uniformly chosen samples with randomly selected ones.

The procedure you've just described for uniform sampling sections (a) and (b) as "uniform" is just one would do when trying to generate a random vector in $[-1, 1]^N$. (I think that's what np.random.rand(1, dimension) does - generates a vector according $N$ coordinates that are independently uniformly distributed on $[0, 1]$.

neonwatty commented 1 year ago

You're absolutely right - I apologize. The wording of this problem is confusing - after our conversation has converged I'll be updating the text and errata with better vocabulary.

In particular - we've used the phrase "uniform" too loosely here. We look at uniform grids (e.g., Figure 2.3) as a simple way over which we can evaluate a function in search of its minimum. The point we want to communicate with its usage here is that evaluating a function systematically - e.g., on a uniform grid - in search of a minima / maxima very quickly becomes inefficient as the input dimension N increases.

We then try to contrast that systematic approach with a random one - sampling uniformly from the input domain of a function, evaluating said points, in search of minima / maxima. Our point here is that this search strategy also fails as the the input dimension N increases.

So in short we've played too fast and loose with the phrase "uniform" and its confusing. We have two naive optimization strategies

  1. input sampling via a uniformly spaced discrete grid --> function evaluation (in search of minima/maxima)
  2. input sampling at random via a uniform distribution --> function evaluation (in search of minima/maxima)

In both cases a strategy of simply sampling and evaluating a function to echo-locate its minima / maxima fail very quickly as input dimension N increases.

OK - now onto the problem itself - 2.1 a)

Here we would like you to show yourself that 1. above - sampling points off a uniform grid (even when the global minimum of a function exists on the grid itself) is a poor strategy.

To do this we must first define a grid on [-1,1]^N - which involves choosing a distance parameter d (as illustrated in Figure 2.3). This is just the distance between each pair of coordinatewise gridlines. Such a distance parameter d implies a corresponding set of possible coordinate values. For example if d=0.2 possible coordinate values (for each coordinate) are

[-1. , -0.8, -0.6, -0.4, -0.2, 0. , 0.2, 0.4, 0.6, 0.8, 1. ]

Of course any point on the uniform grid defined by d=0.2 of any input dimension N has coordinates defined by these values.

One way to then sample an N dimensional point from this grid we can construct the entire input vector placing a discrete distribution over these coordinate values and sample N times, as illustrated below.

### for given N generate random point on uniform grid over [-1,1]^N where d = 0.2 ###
## we first setup a discrete distribution for sampling possible coordinate values
# create possible coordinate values - for given settings --> [-1. , -0.8, -0.6, -0.4, -0.2,  0. ,  0.2,  0.4,  0.6,  0.8,  1. ]
d = 0.2
uniform_grid_vals = np.linspace(-1,1,int((1 - (-1))/d) + 1)

# create uniform discrete probabilities for selecting these coordinate grid values
probabilities = [1/len(uniform_grid_vals) for i in range(len(uniform_grid_vals))]

# choose random point of N dimensional uniform grid
w = np.random.choice(uniform_grid_vals, N, p=probabilities)[:,np.newaxis] # np.newaxis ensures our output is shaped Nx1

If we define our quadratic like something as follows

import numpy as np
def g(w: "Nx1 numpy array - input to quadratic function") -> "1x1 numpy array - output of quadratic for input w":
    return np.dot(w.T,w)

then question 2.1 a) involves evaluating this function for various values of N, for each N doing so P times, and for each N returning the minimum function evaluation achieved.

Does this make more sense?

theundergroundsorcerer commented 1 year ago

It make more sense now.

Thanks.