jnez71 / kalmaNN

Extended Kalman filter for training neural-networks
MIT License
78 stars 23 forks source link

How is the Jacobian Calculated? #9

Closed Melkeydev closed 4 years ago

Melkeydev commented 4 years ago

Hi,

I am working with this for quite some time to replicate onto my own thesis, but I am having some troubles understanding how the Jacobian is calculated.

Can someone share some insight? Or post where I can read/learn about this. It seems the H Jacobian seems to be a combination of stacks?

Lost - any help would be appreciated

jnez71 commented 4 years ago

Back when I implemented the computation of H, it was 4AM during finals week at university, my brain went into overdrive and I somehow banged out the needed tensor-like arithmetic. Sadly, the paper skips detailing this computation. The full derivation is lost forever on some scratch paper in a landfill now probably :(

I advise against trying to understand this line lol

There is hope though. See that H is the jacobian of the networks output ("sensor") with respect to the parameters ("state") and know that we now live in an era where there is no need to ever compute derivatives by hand again, thanks to automatic differentiation. Use a library like autograd to compute the needed jacobian.

You will notice that kalmaNN is limited to single-hidden-layer architecture. That is because of how much my head was going to explode trying to figure out H for anything more complicated. If I had known about automatic differentiation at the time, I could have made this work for any architecture and turned it into a class for training any regression model via EKF. At that point, it's kind of like adding "EKF" (glorified way of saying "Newton's method") to the list of optimizers in an automatic differentiation framework like TensorFlow.

Melkeydev commented 4 years ago

@jnez71 That was one great story! I appreciate it haha!

It almost comes full circle, as now I am a Master's student and I am attempting to do exactly what you describe in your later paragraph. I want to create an EKF class optimizer (or even framework, if time permits). I want to explore different ways of encompassing estimation theory with neural networks. First, multi-class classification and regression with EKF.

Is there a way we can get in touch and maybe I can just talk/bounce some ideas off with you? I know this isn't necessarily the write place for this, but I have been analyzing your KalmanNN for quite sometime now and I even implemented my own NN from scratch with multiple hidden layers.

Let me know!

jnez71 commented 4 years ago

Ah! Actually I was able to dig up the document I wrote when I turned in this as a school project. In it is has some details about the jacobian!

image

Here is the full document if it helps (doubt it). school.pdf

Melkeydev commented 4 years ago

@jnez71 not 100% if you saw my reply as your comment was posted maybe a minute later due to coincidence, but would you be available for some consultation on my thesis project? Just to ask you a few questions as you have experience on this topic

jnez71 commented 4 years ago

@Amokstakov Whoops just saw your last comment after closing the issue. Sorry!

The reason higher-order optimization routines like the EKF are not used in deep learning (DL) is because they are more concerned with generalization than fitting. DL prefers to use models with immense capacity and train "inaccurately" with things like stochastic gradient descent. They literally add noise to the training process itself just to shake things up and apply stochastic regularization like dropout to keep the model using its full capacity. For DL, an EKF update is kind of a waste of time just like other trust-region methods or line-searches - they would rather spend the time that one Newton / EKF step takes to compute to do 20 SGD steps.

Meanwhile, the pictures in my README would lead one to think SGD is a failure compared to EKF! Well, it's not exactly a fair battle. In my examples, the data is subject to legit noise, and the EKF gets to know the covariances while SGD doesn't. So of course the EKF performs better. Further, the architecture is very shallow, and I'm fitting very low dimensional data (1-3 dimensions is tinyyy). Basically, I show the EKF (covariance-weighted Newton's method) outperforms SGD for small-scale nonlinear regression subject to noise, not for deep learning.

Now you never said you meant to make your thesis about using EKF for DL of course, I just wanted to get that out there in case this repo was misleading. I do think it's cool that you're exploring this further.

Anyway, as far as reaching out, I am pretty busy lately so I don't think I'll be able to be too responsive, but I don't mind answering a few more short questions here on this thread (even though it's "closed"). It might provide insight to other people who wander in here wondering how the heck H is computed.

Melkeydev commented 4 years ago

Thank you for the in-depth answer, and obviously the biggest take away is understanding how the Jacobian was calculated. In terms of the my own personal thesis, that is pretty upsetting news. I was hoping to create shallow MLP from scratch and integrate them with different Estimation Theory optimizers and compare these results with standard SGD and ADAM optimizers that are currently used in DL.

With that said, do you have any suggestions/thoughts on potential other ideas that I can explore/demonstrate in a Master's thesis regarding DL (does not necessarily need to incorporate estimation theory, but I personally think there is some possibilities there)?

Let me know!

Melkeydev commented 4 years ago

@jnez71 apologize for double post, but even I was exploring using EKf/KF for human tracking ?

jnez71 commented 4 years ago

In terms of training accuracy, higher-order optimizers and those rooted in stochastic estimation theory (whenever actually relevant to the problem) will probably always beat out SGD / ADAM. But in DL the important comparison is validation accuracy (generalization) and that's where the performance difference won't be so striking, but SGD / ADAM will continue being faster. Essentially, the more domain knowledge / structure (less parameters) your model has, the more useful it will be to use an advanced optimizer because you really want that global cost minimum. The examples in this repo are like that - low dimensional and highly regular data sets with well-characterized noise. Basically, this repo doesn't demonstrate DL, it demonstrates nonlinear regression. I don't even make validation sets lol

In DL, we typically have little to no domain knowledge and use completely unstructured models with immense capacity like deep neural networks. In that setting, there isn't really a "true" correct set of parameters that estimation theory even applies to. Many different local optima will all provide great training accuracy, differing from each other only in how they generalize because they learned different equally discriminatory patterns in the training data. The global optimum might not even generalize the best! So if SGD converges faster and matches the validation accuracy an EKF gets us, why bother with waiting longer for the EKF? That is kind of the DL mindset right now regarding optimizers, and again the reason why we don't see DL using trust-region or line-search.

So what would someone wanting to work on estimation theory focus on then? Well, the real-world modeling problems that have reasonable amounts of structure of course! Maybe it's not a good idea to use an EKF to train ImageNet, but it is certainly a good idea to use an EKF for human pose tracking! Classical mechanics is highly structured and estimating mechanical state variables like joint angles is an application where dynamical systems estimation theory shines. In other words, human tracking isn't really a DL application.

Now maybe anomaly detection in human movement data is a DL application, because what makes an anomaly is a very unstructured subjective thing. I would expect, for example, an EKF used to reconstruct human pose trajectory from camera and IMU data (a very structured estimation problem), and then an ADAM-trained neural network used to infer things like anomalies or demeanor, etc., from that trajectory data.

Now, to clarify, this has all been about EKF vs common optimizers (SGD / ADAM) for training a model. We've been calling the EKF an application of "estimation theory" and discussing using the EKF to train an NN as "combining estimation theory with DL." But "combining estimation theory with DL" goes way beyond training methods. Estimation theory / statistics has become the foundational theoretical understanding of DL in the last decade. We think now of DL architectures as variational statistical inference, where the NN (with softmax output) is just a parameterized probability distribution, and the whole DL process is rephrased as learning the joint probability distribution that generated the data set. So indeed estimation theory and DL are very much intertwined, just not with regards to the actual optimization routines used to do the actual parameter fitting. In many ways, DL is a subset of estimation theory!

I'm sure wherever your thesis takes you, it will be interesting! Good luck!