iancovert / Neural-GC

Granger causality discovery for neural networks.
MIT License
198 stars 51 forks source link

NaN loss and all 1 GC #8

Closed SwapnilDreams100 closed 2 years ago

SwapnilDreams100 commented 2 years ago

HI I am using this for non-linear GC for some brain time series, which looks like this: image This is what it looks like on training with the params: CRNN context=10, lam=10.0, lam_ridge=1e-2, lr=1e-3, max_iter=20
----------Iter = 50---------- Loss = 151.557373 Variable usage = 99.95% ----------Iter = 100---------- Loss = nan Variable usage = 57.95% ----------Iter = 150---------- Loss = nan Variable usage = 50.97% ----------Iter = 200---------- Loss = nan Variable usage = 42.81% ----------Iter = 250---------- Loss = nan Variable usage = 36.39% ----------Iter = 300---------- Loss = nan Variable usage = 38.50% Stopping early

The estimated GC is also all 1. Any intuition will be helpful!

SwapnilDreams100 commented 2 years ago

Hi I think I solved it by setting context as 1, still would appreciate some intuition

iancovert commented 2 years ago

Hi Swapnil,

Encountering nan loss typically means that there was some instability during training that caused the loss to blow up. This can happen for a couple reasons, it's a bit hard to diagnose without poking around. But in any case, it means that training didn't happen properly so the GC results are meaningless.

Note that the GC matrix starts out with all 1s, and 0s only appear as the sparse penalties encourage some parameters to become very small. This won't happen if the training malfunctions early on. In this case, we're seeing the GC evolve despite encountering nans, but this is most likely because only a small number of the networks encountered nans (there's a separate network forecasting each time series).

It's good to see that setting context = 1 fixed the problem, but I think we shouldn't be happy with that solution. It means that the RNN is functioning as an MLP and only looking at one previous time point. Let's see if we can make training stable with a more reasonable value like context = 10.

My first recommendation is to lower the learning rate. Your time series has some relatively large values, and the correspondingly large gradient steps could be problematic. So maybe try lr =- 1e-4. And if that doesn't help, try 1e-5. Let me know how that goes.

SwapnilDreams100 commented 2 years ago

Hi Ian, Really appreciate the great explanation. Reducing the LR to 1e-4 solved the problem, its now working nicely for context=10, loss stably reduced to 5.8. Actually, I want to compare the GC values from non-linear neural methods with the standard GC values based on the p-values. What is interpretation of the estimated GC scores/ the variable usage percentage? Also the context parameter is synonymous to the lag in standard approach right? Thanks a lot!

iancovert commented 2 years ago

Great, glad to hear that lowering the learning rate stabilized training.

The variable usage percentage represents what portion of entries in the GC matrix are 1s. That should be helpful for finding a good lam value, because you want one that gives you a reasonable amount of sparsity, and you can tell if your lam value is too high or too low if the variable usage percentage stays near 100% or 0%.

Once training is done, you can get the GC matrix (the notebooks should have examples of this) to see which time series Granger cause which other ones. These results can be compared to other methods, e.g., a VAR model fit with lasso or group lasso penalties should return a similar result. Unlike some other methods though, our models don't return p-values, they just return a best guess for the GC matrix (which depends on the chosen lam value). It may be helpful to find a couple lam values that give increasing levels of sparsity.

Let me know if that makes sense.

SwapnilDreams100 commented 2 years ago

Hi Ian, Thanks for the explanation, makes a lot of sense! My current Usage is around 3-4% on a few runs, so I should be playing around more to get higher right? Is aiming for 30-40% usage a good benchmark for lam values?

iancovert commented 2 years ago

It depends on the dataset and how much sparsity you want/expect. This is the case for most Granger causality methods: we rarely know the level of sparsity we want, and there's a parameter we can tune that can give us any amount of sparsity between 0-100%.

One way to deal with it is to have a range of lam values to see what the sparsity looks like at different levels. That's basically what we did in our paper, we swept a range of lam values and compared the GC at each level to the ground truth when calculating AUROC/AUPR scores. If you have no ground truth though, I think seeing the results for 40%, 30%, 20% can be helpful (so you would want to lower lam).

SwapnilDreams100 commented 2 years ago

Got it thanks so much for your advice! Happy to close this issue.