iancovert / Neural-GC

Granger causality discovery for neural networks.
MIT License
198 stars 51 forks source link

Variable usage #12

Closed zeyevv closed 2 months ago

zeyevv commented 2 months ago

Dear developer, Thanks for sharing this repository about the Neural Granger Causality project. I was deeply inspired by your paper. I have some questions about the ‘Variable Usage’ you showed during the training process. I know that its calculation is based on the average of the first layer weights, but I would like to ask why you chose it as one of the indicators to measure the model training status. I would be very grateful if you can answer my question.

iancovert commented 2 months ago

Variable usage isn't quite the average of the first layer weights, it's an average of the estimated Granger causality matrix (see here for example). The matrix is calculated using on the norm of the first layer weights along dim=0, and an input is said to be Granger causal if the weights that touch it are non-zero (see here for example).

Anyway, the variable usage number isn't reported in our results, but I found it useful to monitor during training. At initialization, the network will use all the features because no weights have been sparsified, but over the course of training the number will shrink and eventually converge to a stable value. So for example, you can quickly tell if you set the regularization strength too high when the variable usage immediately drops to 0.

zeyevv commented 2 months ago

Thanks a lot for your explainations. I have a another question, when the input data contains multiple slices (like the DREAM dataset), is there any difference in learning the prediction model if I input this data as a three-dimensional (variable × time × number of slices) tensor or a two-dimensional (variable × time, splicing all silces) tensor? I would be very grateful if you can answer my question.

iancovert commented 2 months ago

Just to make sure I understand, it sounds like you have multiple times series (let's say n) each of length T with p dimensions. I would recommend organizing these into a 3-dimensional tensor with shape (n, T, p). That's what our models expect as input (see here). If you instead organized it as a tensor shape (1, n * T, p) by concatenating all the time series, I think the model would struggle at the border of each time series and not train properly.

zeyevv commented 2 months ago

Thanks a lot for your explainations.

iancovert commented 2 months ago

Happy to help, let me know if you have any other questions.

zeyevv commented 2 months ago

Dear developer, I encountered a problem with regularization parameter ’lam‘ selection when testing the DREAM dataset. Can you explain to me how to use cross validation to select the appropriate λ?

iancovert commented 2 months ago

Yes, people often ask us about this. In our experiments we didn't focus on selecting a single best $\lambda$ value for each dataset, we fit our models with a range of values and used the learned sparsity patterns to calculate the AUROC. I believe this is the best approach to evaluation, because comparisons between different learning approaches are unfair if they target different sparsity levels (I've seen several papers make this mistake). When determining a range of $\lambda$ values, we manually found values that resulted in complete sparsity and no sparsity, and then fit with an evenly spaced range of values in between those extremes.

If you wanted to select a single best $\lambda$ value for your dataset, cross-validation seems like a reasonable approach. You'll risk over-selecting, because having more variables typically doesn't hurt accuracy, but that's a common issue when tuning regularization for sparse models.