open-connectome-classes / StatConn-Spring-2015-Info

introductory material
18 stars 4 forks source link

Calculating the gradient #67

Open SandyaS72 opened 9 years ago

SandyaS72 commented 9 years ago

The principle of how gradient descent works makes sense, but it still relies on calculating a "gradient" at any point. When there's a closed form representation as a function, this is doable, but what about cases where the gradient is too complicated to calculate or you can't take a derivative at a particular location? Or if the original data doesn't have a nice function representation? Basically what are ways or numerically approximating the gradient that are sufficiently accurate and reliable or other methods of finding a max/min besides gradient descent in those cases?

imichaelnorris commented 9 years ago

I worked in a lab that used an algorithm called Covariance Matrix Adaption Evolution Strategy [1] for calculating protein structures. We wanted to be able to have an algorithm that we could use when the gradient wasn't always easy to calculate. Lots of labs will use gradient descent for this because it's fast, but we wanted to have a more general-purpose method available for more complicated structures. We were able to get this algorithm to converge on what we deemed an "optimal solution." That is to say, it was able to find the global minimum in our test dataset, just like the gradient descent could, if the data had a "pretty gradient."

In general, gradients may not have nice functional representation. How it could be approximated depends on the gradient, in my opinion, and how close you want the approximation to be.

[1] http://en.wikipedia.org/wiki/CMA-ES

imichaelnorris commented 9 years ago

Also, if you can collect enough data, you can always calculate a numerical gradient [1] of the data expressed as an array. This is useful if you don't know the closed form representation of the data (if there even is one). Then you can evaluate the numerical gradient at any point (within the range of your data).

[1] http://www.mathworks.com/help/matlab/ref/gradient.html

akim1 commented 9 years ago

For noisy data, one way of obtaining a derivative is to fit your data to a n-degree polynomial and determine their respective coefficients. This is a simple matrix multiplication with some "weightings" (you can look up Savitzky-Golay filtering if you're interested).

Then, you can calculate the derivative just by computing the derivative of the polynomial with the known coefficients.

maxcollard commented 9 years ago

It's worthy of note that for polynomial fit, the complexity of the problem is profoundly influenced by the dimensionality d of the input to your objective function: even quadratic fits about a point have O(d^2) coefficients.

Now consider the problem of sampling the gradient in many directions about a point. We want to pick enough points to make it so that we have at least one point per e units of area on the d-sphere; i.e., we need k samples, where k = A(d) / e and A(d) is the area of the unit d-sphere. Since A(d) goes up as d pi^(d/2) / Gamma(d/2 + 1) which tends to 0 as d -> \infty [1], for large-dimensional inputs, it seems as though it would be much more efficient to sample the gradient than do even a low-order polynomial fit.

[1] http://www.wolframalpha.com/input/?i=plot+d*pi^%28d%2F2%29+%2F+Gamma%28d%2F2+%2B+1%29+d+from+0+to+30

akim1 commented 9 years ago

I was oversimplifying the Savitzky-Golay filter.

In general, the polynomial isn't fit to the whole data but to a single point and a handful of neighboring points, and that single point is adjusted based on the neighboring points. Secondly, polynomials have the nice characteristic of preserving higher-order derivative information.

The simple gradient sampling that you're talking of is essentially polynomial fitting for the linear case.

maxcollard commented 9 years ago

True: if you have higher-order derivatives, you can do Newton-Raphson on the estimated gradient, which probably is "nice" in some more precise sense.