vsavram / AM205-Project

0 stars 0 forks source link

Understand FeedForward.fit() #1

Closed msbutler closed 3 years ago

msbutler commented 3 years ago

@vsavram @jscuds My commit contains a scratch.ipynb file with a FeedForward class, used to fit a feed forward neural network. The first few cells in the notebook, before the NLM class definition, define the FeedForward class, instantiate a Neural Network, and fit it to dummy data. We'll use FF to calculate the weights for the n-1 layers in the LUNA model.

As we discussed, you two will be investigating ways to modify the weight fitting algorithm by changing 1) the optimization algorithm, 2) gradient calculation. The only code you may need to modify is the FeedForward.fit() method. which executes weight fitting. The default optimizer is adam and the default gradient function is an autograd instantiation, self.gradient. You can modify the optimizer and the gradient function in the params dictionary, an input to the fit function.

As for next steps: 1) try to understand at a high level how the fit function works 2) take a look at the adam package, and investigate how one could pass a numerical gradient function, instead of an autograd object 3) I believe the adam optimizer is state of the art for fitting Nueral Networks. It could be worth exploring what other optimization algos people use to fit NN's.

General project management proposals:

vsavram commented 3 years ago

We went through the code that was submitted and looks great! We have a few questions pertaining to the body of the Feedforward class given below:

msbutler commented 3 years ago

@vsavram sorry for slow response here! These are all good questions. My answers:

  1. We reshape a 2D input array into a 3D array to speed things up in the forward method (i.e. easier to parallelize training). The forward method merely calculates the output given a set of inputs and weights. You will never need to change this method, so no need to understand the details of it. The only thing you should know are the dimensions of weights (1 by D), x (1 by N), and output (1 by 1 by N....( you can always flatten the output of this )).

  2. np.sum and np.mean are used to please the autograd function, which automatically differentiates the objective function. autograd likes to differentiate functions that use autograd.numpy operations (note this code uses autograd's wrapper for numpy and not numpy itself). Sometimes you can get away with using non np operators, but it is safest to use np operations when designing objective functions.

  3. I'm not totally confident in my answer, but my hunch is that this is another speed up step. Suppose after a random restart, the optimizer converges to a local min in 4000 steps, to actually calculate the local min, we don't really need to consider the first 3900 steps because we know with 99.99% probability that the local min was reached in the last 100 steps, logged in self.objective_trace[-100:]. Why not just evaluate the last step in the trace? Adam is stochastic, (as are all other opt algo's for NN), so it's not guaranteed that the last point in the trace is the min. @mecunha additional thoughts here?

mecunha commented 3 years ago

Adding on to @msbutler's response:

  1. The squared_error line does return an array because the norm is taken along an axis: squared_error = np.linalg.norm(y_train - self.forward(W, x_train), axis=1)**2 The "axis=1" input into np.linalg.norm means it takes the norm over the second dimension of the array. Another way to think about it: If you have a matrix and you take the sum of that matrix along axis=1 (the column dimension), you'll end up with an array where each element represents the sum across all the columns of the matrix. Here's an example: m = np.array([1,1,1,1,2,2,2,2]).reshape(2,4) print('m: ', m) print('sum: ', np.sum(m, axis=1)) returns: m: [[1 1 1 1] [2 2 2 2]] sum: [4 8]
  2. I also am not 100% sure but agree with Michael's intuition - the minimum of the trace is very likely going to be in the last 100 or so elements, so it's more efficient to calculate the min across 100 elements than across the entire trace. Even if the true minimum is outside of this range, we'll likely obtain something very, very close to the true minimum in the last 100 elements, assuming convergence. @msbutler perhaps we can ask Weiwei to confirm this understanding?
msbutler commented 3 years ago

we confirmed that my initial answer is correct. closing this issue.