Closed msbutler closed 3 years ago
We went through the code that was submitted and looks great! We have a few questions pertaining to the body of the Feedforward class given below:
[ ] We're assuming that the input is strictly 2D (multiple samples from multiple variables) or 1D (just one variable).
if len(x.shape) == 2: assert x.shape[0] == D_in x = x.reshape((1, D_in, -1)) else: assert x.shape[1] == D_in
In the above code block, why are we indexing the shape of x for the second element? If x.shape isn't 2D, then I assume it must be 1D.
[ ] In the definition of objective()
within make_objective()
, is the squared_error just a scalar and not an array? Taking the 2-norm of our true y values and predicted y values should just give a scalar. Are we missing something given that in the if/else statement, np.sum and np.mean are used.
[ ] In the fit()
function and specifically in the for loop iterating for the number of random restarts, why are we only taking the minimum of the last 100 elements of the objective trace and not over the entire trace?
@vsavram sorry for slow response here! These are all good questions. My answers:
We reshape a 2D input array into a 3D array to speed things up in the forward method (i.e. easier to parallelize training). The forward method merely calculates the output given a set of inputs and weights. You will never need to change this method, so no need to understand the details of it. The only thing you should know are the dimensions of weights
(1 by D), x
(1 by N), and output
(1 by 1 by N....( you can always flatten the output of this )).
np.sum and np.mean are used to please the autograd function, which automatically differentiates the objective function. autograd likes to differentiate functions that use autograd.numpy operations (note this code uses autograd's wrapper for numpy and not numpy itself). Sometimes you can get away with using non np operators, but it is safest to use np operations when designing objective functions.
I'm not totally confident in my answer, but my hunch is that this is another speed up step. Suppose after a random restart, the optimizer converges to a local min in 4000 steps, to actually calculate the local min, we don't really need to consider the first 3900 steps because we know with 99.99% probability that the local min was reached in the last 100 steps, logged in self.objective_trace[-100:]
. Why not just evaluate the last step in the trace? Adam is stochastic, (as are all other opt algo's for NN), so it's not guaranteed that the last point in the trace is the min. @mecunha additional thoughts here?
Adding on to @msbutler's response:
squared_error = np.linalg.norm(y_train - self.forward(W, x_train), axis=1)**2
The "axis=1" input into np.linalg.norm means it takes the norm over the second dimension of the array. Another way to think about it: If you have a matrix and you take the sum of that matrix along axis=1 (the column dimension), you'll end up with an array where each element represents the sum across all the columns of the matrix. Here's an example:
m = np.array([1,1,1,1,2,2,2,2]).reshape(2,4) print('m: ', m) print('sum: ', np.sum(m, axis=1))
returns:
m: [[1 1 1 1]
[2 2 2 2]]
sum: [4 8]we confirmed that my initial answer is correct. closing this issue.
@vsavram @jscuds My commit contains a scratch.ipynb file with a FeedForward class, used to fit a feed forward neural network. The first few cells in the notebook, before the NLM class definition, define the FeedForward class, instantiate a Neural Network, and fit it to dummy data. We'll use FF to calculate the weights for the n-1 layers in the LUNA model.
As we discussed, you two will be investigating ways to modify the weight fitting algorithm by changing 1) the optimization algorithm, 2) gradient calculation. The only code you may need to modify is the
FeedForward.fit()
method. which executes weight fitting. The default optimizer isadam
and the default gradient function is an autograd instantiation,self.gradient
. You can modify the optimizer and the gradient function in theparams
dictionary, an input to the fit function.As for next steps: 1) try to understand at a high level how the fit function works 2) take a look at the adam package, and investigate how one could pass a numerical gradient function, instead of an autograd object 3) I believe the adam optimizer is state of the art for fitting Nueral Networks. It could be worth exploring what other optimization algos people use to fit NN's.
General project management proposals: