jaamestaay / M2R-Group-29

This repository complements the MATH50002 (M2R) research report and presentation, a compulsory year two Mathematics module at Imperial College London. We introduced and explored neural networks and the techniques required to train a neural network. We then discussed neural ODEs and their increased accuracy, before extending to neural CDEs.
0 stars 0 forks source link

Theory: Automatic Differentiation and Gradient Descent Techniques #2

Closed jaamestaay closed 3 months ago

jaamestaay commented 3 months ago

I covered two broad areas during my research this week:

  1. Automatic Differentiation As we've seen from this year's Numerical Analysis module, dual numbers are a straightforward and computationally friendly method of calculating derivatives. The ideas portrayed in the module aforementioned barely scratches the surface of Automatic Differentiation.

1.1 Motivation for Automatic Differentiation With large neural networks, there are many intermediary functions that compose the final mapping from the input vector to the output vector (linear maps, ReLU functions, etc). Gradient descent techniques (as elaborated below) require the analysis of gradients across all layers. It is thus paramount for algorithms to efficiently calculate derivatives with a high level of accuracy. We note that divided differences incur large errors that scale and calculating symbolic expressions tend to be memory- and time-consuming. Automatic Differentiation provides an efficient and accurate method for calculating derivatives of algorithmically complex functions.

1.2 Automatic Differentiation Techniques Many of these techniques were covered in the SANUM2024 labs provided. Without elaborating too much into the theory, we can find Julia implementations for the required functions (that use automatic differentiation under the hood) to aid in our project (see Zygote.jl, etc). Pullbacks are an especially important function in computing these gradients efficiently. There's a great paper by Michael Innes detailing the use of Static Single Assignment to generalise and make gradient calculations more efficient in writing the Zygote package.

1.3 Applications of Automatic Differentiation I worked on two main applications of Automatic Differentiation. The first one (which unfortunately turned out to be a dead end) was the application of AD in solving ODEs. This involved expressing the proposed solution and the RHS of the differential equation about the initial value t_0, and equating terms to determine the coefficients of the Taylor series of the solution. I attempted to create a class of dual numbers to calculate the partial derivatives of the RHS, but found difficulty with the implementation and the accuracy of the solution. The documentation of this process can be found in autodiff.ipynb.

The (perhaps obvious) other application is the calculation of gradients in the deep learning process. As mentioned in 1.1, we require efficient methods of calculating derivatives of expressions that are perhaps difficult to find symbolically. Thus automatic differentiation lends itself very useful in gradient descent, as elaborated in 2 below. In hindsight, this is something I should've spent more time on and I'll look into this more closely in the coming week.

  1. Gradient Descent With some given data $/{ x_i, yi /}^n{i=1}$, model $f\theta(x)$ and loss function $l(u, v)$, we want to find $\theta$ that minimises $L(\theta) = \sum^n{i=1} l(f_\theta(x_i), y_i)$ We start with the basic form of gradient descent: iteratively calculating $\thetat$ by $\theta{t+1} = \theta_t - \eta \nabla L(\theta(t))$. The calculation of $\nabla L(\theta(t))$ is the second application of AD as mentioned previously.

There are some inherent issues with this basic version of gradient descent, and thus there are several different methods to improve it:

2.1 Adaptive Gradient Methods (AdaGrad, RMSProp, ADAM etc) Unfortunately I wasn't able to cover this in depth. Most adaptive gradient methods attempt to mitigate the memory usage and computational complexity of previously derived gradient descent methods. The literature states that these methods are better for networks with many hidden layers.