These scripts are used to analyze a series of trajectories, look at their scores, determine when each score has equilibrated, and select models for further analysis. Sample scrips are in the example folder. These scripts assume that you are a series of IMP runs named run_0, run_1, ...
AT = AnalysisTrajectories(out_dirs,
dir_name='run_',
analysis_dir=analysis_dir,
detect_equilibration=True,
burn_in_fraction=0.02,
nskip=20,
nproc=1,
plot_fmt='pdf')
The minimum set of arguments to specify are dir_name
: the prefix of the run folders, analysis_dir
: the directory where all analysis output will be written, and nproc
: the number of processors that are used to do parallel I/O. Additional options are: detect_equilibration
which when set to False
turns off rigorous statistical determination of the burn-in or equilibration phase and instead just discards a burn_in_fraction
(default 2% of the trajectory size) fraction of frames from the beginning of each independent run, to calculate statistics. nskip
is used to thin out the samples collected for analysis, i.e. every nskip
^th frame is selected. Finally, plot_fmt
specifies the file extension of all matplotlib figures generated. You should only set detect_equilibration=False
if you are testing your analysis script and need it to run fast without waiting for calculating the exact length of the monte carlo burn-in phase.
XLs_cutoffs = {'DSSO': 30.0}
AT.set_analyze_XLs_restraint(XLs_cutoffs=XLs_cutoffs)
AT.set_analyze_Connectivity_restraint()
AT.set_analyze_Excluded_volume_restraint()
If you set up more than one XLs restraint, you should include the Multiple_XLs_restraints=True flag in the analysis and include a cutoff for all of them in the XLs_cutoffs dictionary. For example, if you divided the DSSO XLs into a intra-subunit and inter-subunit datasets, there should be two elements in the XLs_cutoffs dictionary (even if they have the same cutoff):
XLs_cutoffs = {'DSSO_Inter': 30.0, 'DSSO_Intra': 30.0}
AT.set_analyze_XLs_restraint(XLs_cutoffs=XLs_cutoffs,
Multiple_XLs_restraints=True)
Using the labels you used to setup the restraint and keys.
Similarly, if there is ambiguity in the XLs assignments (i.e. you have multiple copies of the same protein), you should use the ambiguous_XLs_restraint=True option:
AT.set_analyze_XLs_restraint(XLs_cutoffs=XLs_cutoffs,
ambiguous_XLs_restraint=True)
If you want to analyze a restraint that is not standard to IMP, you can do so by including the handle associated to that restraint in the stat file. E.g. if your restraint is called COMDistanceRestraint_data_Score
in the stat files produced after sampling (i.e. the restraint handle) and you'd like the restraint score to be referred to by a short name, say COM
throughout the various analysis output data-frames:
AT.set_analyze_score_only_restraint(handle='COMDistanceRestraint_data_Score',
short_name='COM',
do_sum=True)
The do_sum=True
option will add up all restraint scores that start with the restraint handle as specified (e.g. COMDistanceRestraint_data_Score_1
, COMDistanceRestraint_data_Score_2
, etc). In general it may be desirable to add scores of the same type prior to score based clustering, since that decreases the dimensionality of the score space and helps the clustering process converge better. E.g., restraints like connectivity and excluded volume have their scores added up by default. Note that this method will only analyze the scores for the restraint, not nuisances or other associated statistics.
AT.read_stat_files()
In this step we also automatically determine the equilibration time for each restraint and nuisance particles. Only models after equilibration are considered for further analysis (unless, as mentioned before the detect_equilibration
option is set to False. )
Reference for equilibration detection:
Utilities for automatically detecting equilibrated region of molecular simulations. DOI: 10.1021/acs.jctc.5b00784
John D. Chodera john.chodera@choderalab.org
Obtain the statistics of the XLs restraint Psi nuisance parameter use:
AT.get_psi_stats()
Do HDBSCAN clustering for selected scores and/or nuisance parameters:
AT.hdbscan_clustering(['EV_sum', 'XLs_sum'])
If you added a non-standard restraint, you can use it in clustering with the name 'COMDistanceRestraint' previously used (everything before the first _)
AT.summarize_XLs_info()
This will create a series of files and plots summarizing the XLs distances and satisfaction in all the clusters obtained in step 6. Files ```plot_run_models_cluster*.<plot_fmt>``` show the number of models from each run that are in each cluster. Files ```plot_scores_convegence_cluster*.<plot_fmt>``` show the scores distribution for each cluster.
Files ```all_info_*.csv``` contain the information about all models after equilibration. These files can be used to re-run the clustering step:
```python
AT.read_models_info()
AT.hdbscan_clustering(['CR_sum', 'EV_sum', 'XLs_sum'])
After clustering a series of files are written with the information of frames in each cluster.
all_info_*.csv
files:XLs_cutoffs = {'DSSO_Inter': 30.0, 'DSSO_Intra': 30.0}
AT = AnalysisTrajectories(out_dirs,
analysis_dir=analys_dir,
nproc=nproc)
AT.read_models_info(XLs_cutoffs)
AT.hdbscan_clustering(['EV_sum', 'XLs_sum'],
min_cluster_size=200,
min_samples=5,
skip=5)
AT.summarize_XLs_info()
Reference for HDBSCAN clustering:
http://hdbscan.readthedocs.io/en/latest/index.html
To extract the models from the rmf3 file, use the script run_extract_models.py
.
To test for convergence and do structural clustering, use the script run_clustering.sh
Reference for convergence and clustering anlysis:
Assessing exhaustiveness of stochastic sampling for integrative modeling of macromolecular structures. https://doi.org/10.1016/j.bpj.2017.10.005
get_accuracy_rmfs.py