Adding notebook with permutation importance examples

hannahbrown7 commented 2 years ago

This notebook contains examples of how to calculate Breiman (2001) and Lakshmanan (2015) interpretations of permutation importance. Breiman PI works with models which are both trained on only vertical profile features and those trained on a mix of vertical profile and single level features. Lakshmanan PI currently only works with models which is trained on vertical profile features.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

stevehadd commented 2 years ago

That looks good to me. Some thoughts questions:

I was having some problems with the autolog command in cell [5]. Did you have any issues? I'm sure this worked fine for us previously, not sure what changed. Wasn't important for this review, just concerned something is broken in the environment somewhere.
- I was also having a problem with cell24, which I resolved by changing the import line from from keras to from tensorflow.keras. So i think there's probably something weird going on, it would be good to confirm what environment you were using, and if it works when you run the notebook again?
General comment - I'm not sure if this is what is generally done, but I think with notebooks its better (and easier to read in my opinion) to put your comments about the code in separate markdown cells, rather than traditional python comments. That's kind've a reason why we use notebooks.
We need to do an "official" split of this data ASAP. Which is to say we need to select inparticular a test set and put it somewhere and not touch it until much later in the project. I think we've decided that we don't need to do anything special, so perhaps something to take care of soon.
In cell 28 where you're doing the calculation, how are you handling the multilevel features. Are you swapping the whole "column" worth of features?
- I was thinking it would also be good to swap the variable 1 level at a time, to see whether particular levels of the atmosphere are are more or less important.
cell 30 - weirdly, I get quite a different plot: . I don't know what this means, but maybe we need to run more times to find a consistent result? We certainly seem to be getting a lot of variability, maybe because of different splits? Random network initialisation weights? This is a small dataset so perhaps not surprising, maybe just something to keep an eye on. We'll learn more from cross-validation studies.
- Also different for Lakshmanan importance:
The bottom dozen or so cells are not executed. If they don't form a part of the narrative of this notebook, we should remove them from the notebook to keep it cleaner in terms of what we want to show.
As with everything we've done so far, it will be interesting to see if and how feature importance differs in different events/regimes etc.

hannahbrown7 commented 2 years ago

Having looked it up, general advice is towards using tf.keras rather than keras, so have updated the code to use tf.keras in cell24

hannahbrown7 commented 2 years ago

Regarding how multilevel features are treated during permutation importances: currently, multi-level features are permuted based on their order, the structure of the height levels within the profile is maintained. However, I agree that it will be really interesting to dig down assess the impact of the different height levels. From the data exploration we saw that there was quite a lot of correlation between height levels that were near each other and one of the limitations of permutation importance is if the feature that is permuted are highly correlated with another feature then it appears to have little impact as the model gets the information from the other correlated feature making the permuted feature look less important than it perhaps actually is. So may need a bit of consideration as to whether to take a different approach or just careful interpretation.

hannahbrown7 commented 2 years ago

Thank you for flagging about different permutation importance plots being created! A random seed is set for splitting the data, so I don't think that will be a factor. It may be partly down to the lack of random seed in the ML model, which my understanding results in as you say different network weight initialisations. Also as it is a small dataset and the features are being randomly permuted, this may cause some variation - probably a combination of factors. While it does seem to vary quite a lot, and concerning that the features don't always seem to be in the same order.

hannahbrown7 commented 2 years ago

Have removed cells at the bottom of the notebook which are not relevant to the feature importance assessment

hannahbrown7 commented 2 years ago

Agree assessments of different regimes and/or weather events will be interesting

informatics-lab / precip_rediagnosis

Adding notebook with permutation importance examples #27