greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 62 forks source link

Scripts to Loop over hidden layer (z) dimensionality #117

Closed gwaybio closed 6 years ago

gwaybio commented 6 years ago

Contingent upon results of #116 - but is the logic that will submit several jobs to PMACS training various compression algorithms with a decreasingly constrained bottleneck. Several intermediate results will be saved for post-hoc analyses.

~Note that several methods in train_models_given_z.py have not yet been implemented. These are the scripts that process and aggregate results across models.~

:point_up: Update - these are now implemented

gwaybio commented 6 years ago

The pull request stores the framework to loop over different combinations of latent space (z) dimensionality.

The most important script in this pull request is train_models_given_z.py. This script will ingest configuration files and a given dimensionality, fits pca, ica, nmf, ADAGE, and Tybalt models on pancancer RNAseq data a given number of iterations, and outputs the results of several evaluations and also data.

Note that the evaluations presented in this script are immediate. i.e. they are contingent upon and measure fitting iterations across seeds. Therefore, these evaluations measure stability of solutions across iterations. These evaluations include:

  1. Reconstruction errors across all models for each seed iteration (this includes separating loss of Tybalt models into KL divergence and reconstruction)
  2. Within algorithm correlation matrix determinants (for weight and z matrices) (asks the relative magnitude of correlation across iterations where converging to 0 indicates perfect stability)
  3. Across algorithm correlation matrix determinants (for the "best" models per algorithm specifically)

The determinant of correlation matrices indicate the density of correlation across either latent space components or weight matrix features.

The script also outputs data for additional post-hoc analyses. Some of these post-hoc analyes @jaclyn-taroni and I have discussed already. The data include:

  1. Weight matrices for all algorithms
  2. Z matrices for all algorithms
  3. Training histories for both Tybalt and ADAGE across training epochs.
gwaybio commented 6 years ago

Sorry for the influx of PR review requests... They should slow down a bit after this one