Add sensitivity analysis

luigibonati commented 8 months ago

Description

Add a sensitivity analyisis as done in the DeepLDA/DeepTICA paper, as requested in #105.

Todos

[X] sensitivity_analysis in utils.explain
[X] regtests
[X] plot_sensitivity_analysis in utils.plot
[x] add it in deeplda example notebook
[x] create a tutorial to explain it
[x] retrieve automatically feature names from dataset (#106)

For now I created a new module inside utils called explain, altough we might also move it outside utils and create an explain module which contains this together with other things such as the sparse classification with Lasso (@pietronvll)

codecov[bot] commented 8 months ago

Codecov Report

Merging #108 (d0399d3) into main (4fc7301) will increase coverage by 0.36%. Report is 2 commits behind head on main. The diff coverage is 92.35%.

Additional details and impacted files

luigibonati commented 8 months ago

In order to automatically retrieve the input descriptors names I added an attribute feature_names to DictDataset (both in the constructor and as a property) and changed the function create_dataset_from_file to automatically set it when creating a dataset from a dataframe. let me know if you have comments

EnricoTrizio commented 8 months ago

I would prefer to have the plots horizontally rather than vertically, please fix this @luigibonati :hugs:

jinmei-1 commented 8 months ago

Hello! Thank you for providing the sensitivity analysis code. We have tested it and found its performance to be excellent. However, we encountered an issue during the simplification of our system's 4536 descriptors, specifically regarding the image size being too large. In particular, we received the following error message: ValueError: Image size of 500x113400 pixels is too large. It must be less than 2^16 in each direction. We would like to seek your guidance on how to modify the code to address this issue. Additionally, while reading your article "Deep learning the slow modes for rare events sampling," we interest in the method you mentioned for reducing the descriptor set in the Chignolin Folding case. In your research, you reducing the number of descriptors to from 4278 to 210 by selecting the most relevant ones through sensitivity analysis of the primary CVs. ### We would like to learn more about the criteria you used to select these 210 descriptors.Why is it 210 instead of 220 or 200? We appreciate your assistance and look forward to your guidance. Meanwhile, we are sharing our test code with you for a better understanding of our issues. Thank you for your time and support. We eagerly await your response. 12-21-code and COVARL.zip

luigibonati commented 8 months ago

Hello! Thank you for providing the sensitivity analysis code. We have tested it and found its performance to be excellent. However, we encountered an issue during the simplification of our system's 4536 descriptors, specifically regarding the image size being too large. In particular, we received the following error message: ValueError: Image size of 500x113400 pixels is too large. It must be less than 2^16 in each direction. We would like to seek your guidance on how to modify the code to address this issue. Additionally, while reading your article "Deep learning the slow modes for rare events sampling," we interest in the method you mentioned for reducing the descriptor set in the Chignolin Folding case. In your research, you reducing the number of descriptors to from 4278 to 210 by selecting the most relevant ones through sensitivity analysis of the primary CVs. ### We would like to learn more about the criteria you used to select these 210 descriptors.Why is it 210 instead of 220 or 200? We appreciate your assistance and look forward to your guidance. Meanwhile, we are sharing our test code with you for a better understanding of our issues. Thank you for your time and support. We eagerly await your response. 12-21-code and COVARL.zip

hi, thanks for the feedback. i added a tutorial in the documentation where i show how to customize the analysis. it should work now, let me know!

i added that by default it only prints the first 50 features, but this can be changed in the plot_sensitivity options
i have also added how to create a new dataset using only the first N features. how to choose N depends on 1) how is the distribution of features 2) how much is the cost of using the resulting CV. i remember that regarding the deeptica paper you were getting qualitatively similar results using e.g. 100, 200 or 300 features

luigibonati commented 8 months ago

this closes also #105 and #106

jinmei-1 commented 8 months ago

Hello! Thank you for providing your tutorial and the new code. We've adapted and executed DeepTICA (our protein-ligand system) based on your DeepLDA tutorial. We successfully plotted sensitivity, but encountered an issue when attempting to create a new dataset using only the first N features, as indicated below: KeyError: "None of [Index(['2003', '2094', '2007', '2099', '2005', '2008', '2046', '2004', '2100', '2006'], dtype='object')] are in the [columns]"

Could you please guide us on how to modify it to suit DeepTICA data? Additionally, we have a minor query regarding this line of code: relevant_features = results['feature_names'][-n_features:]. Why are the last n_features elements selected from the end of the feature_names list, rather than choosing n_features from front to back in the order of sensitivity ranking elements? We've attached our test code for your review to provide better insight into our challenges. Your time and support are greatly appreciated. We look forward to your response. 12-27-code and COVARL.zip

luigibonati commented 8 months ago

The features are sorted in ascending order, this is why we need to slice the last N features instead of the first ones.

Regarding the error in creating the new dataset, it is related to the fact that in the case of a time-lagged dataset (at variance with those created with the create_dataset_from_files) there is no information about the feature names, so they have to be specified by hand. Indeed, in the sensitivity plot the features are displayed as numbers and not with the correct names.

let me know if this works


feature_names = colvar.filter(regex='d_').columns

results = sensitivity_analysis(model,               
                               dataset, 
                               metric="mean_abs_val",       # metric to use to compute the sensitivity per feature (e.g. mean absolute value or root mean square)
                               feature_names=feature_names, # by default, they will be taken from `dataset.feature_names` 
                               per_class=False,             # whether to do per-class statistics
                               plot_mode=None)              # plot mode (see below)

I will add a warning that the feature names could not be retrieved if they are not present in the dataset.

Luigi

jinmei-1 commented 7 months ago

Thank you for providing the new code, it works and I get the new deep-TICA(.tpc) model. I have successfully implemented GROMACS_2022.5-plumed_2.9.0 on our server. Additionally, I have configured LibTorch and the PyTorch module, with both "plumed config has libtorch" and "plumed config module pytorch" returning on. However, when attempting enhanced sampling simulations, I encountered an error: GROMACS displayed "free(): invalid pointer." This issue persisted when I tried to run the Deep-TICA example for alanine, as outlined in your paper titled "Deep learning the slow modes for rare events sampling" (Luigi Bonati, GiovanniMaria Piccini, and Michele Parrinello. Proceedings of the National Academy of Sciences, 118(44), 2021).

I find this error perplexing, and I was hoping you could assist me in troubleshooting or provide any guidance on resolving this matter. I have attached relevant files such as plumed.dat, .tpr, .tpc, descriptors.dat, and a log for your reference. Thank you very much for your time and consideration. I appreciate any insights or assistance you can provide.

Best regards, jinmei ref.zip

luigibonati / mlcolvar