ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
317 stars 76 forks source link

Inconsistent mapping of volume PCA clusters to UMAP plot between `analyze_landscape` and `analyze_landscape_full` #414

Closed cbeck22 closed 1 month ago

cbeck22 commented 1 month ago

Hi!

I am currently using cryoDRGN v3.3.3 and have been working with the analyze_landscape command. From my understanding, this command performs PCA on the 1000 volumes from the kmeans1000 directory and plots each particle cluster onto the UMAP plot generated by the analyze command in the filtering notebook.

After filtering out one outlier, I reran the command on the remaining 999 volumes, which produced the following histograms and plots. As shown, the clusters from the volume PCA align well with the clusters in the UMAP plot: image

However, when I applied the same clustering analysis in the Jupyter notebook generated by the analyze_landscape_full command, I observed a completely different mapping of the same 999 volumes onto the UMAP plot. In this case, the clusters identified by the PCA appear to be randomly dispersed: image image image image

I am currently attempting to write my own code in the analyze_landscape_full notebook to plot the clusters onto the UMAP plot manually. Any insights into why the behavior of this notebook differs from the analyze_landscape command would be greatly appreciated!

I'm also not very clear on what exactly analyze_landscape_full does. The state_particle_counts.png histogram generated by the analyze_landscape command already seems to cluster the entire dataset of ~1 million particles. Why does the analyze_landcape_full command generate additional training volumes, and why does it use a neural net to assign all of the particles to clusters?

Best, cbeck

cbeck22 commented 1 month ago

Actually, I suspect I must have made a mistake somewhere...I'll update this when I find out what I did wrong.

Update: Ah, I think I understand now. When running analyze_landscape, I used the --vol-ind option to provide an index .pkl file to filter out an outlier I identified in a previous analyze_landscape run. The PCA was then performed on 999 volumes, not the original 1000.

However, analyze_landscape_full doesn't have the --vol-ind option to filter volumes, and if I manipulate cells 4, 6, and 7: centers_ind = np.loadtxt(f'{landscape_dir}/kmeans{K}/centers_ind.txt').astype(int) vol_pc = utils.load_pkl(f'{landscape_dir}/vol_pca_{K}.pkl') kmeans_labels = utils.load_pkl(f'{landscape_dir}/kmeans{K}/labels.pkl')

to refer to the 999 volumes I filtered using cryodrgn_utils and analyze_landscape, this results in a discrepancy that messes up the plotting, because some of the particles in the full dataset would have been assigned to the 1 volume that I filtered out.

This made me realize I had some misconceptions about what analyze_landscape and analyze_landscape_full both do.

  1. The clustering of the full dataset in analyze_landscape occurs when the kmeans algorithm identifies 1000 clusters. analyze_landscape selects a representative volume from each of the 1000 kmeans clusters and runs a PCA directly on the .mrc files. Am I right in thinking that the agglomerative clustering algorithm clusters the 1000 representative volumes directly on their .mrc files and NOT on the PCA coordinates?

  2. I assumed that analyze_landscape_full uses agglomerative clustering to cluster the full dataset of particles after doing the PCA analysis on the whole dataset. In reality, the Jupyter notebook just fetches the kmeans1000 clusters calculated by analyze_landscape and plots those.

So, a few questions:

  1. Is there a reason why the notebook doesn't run agglomerative clustering on the full dataset? Is it too computationally intensive?
  2. What is the purpose of the neural net for mapping the full dataset to the PCA landscape calculated by analyze_landscape for the 1000 kmeans representative volumes? Is it too computationally intensive to run PCA on the .mrc files of the full dataset, and does the neural net somehow interpolate what PCA values the rest of the dataset's particles should have?
  3. Would you be able to add some more information to the analyze_landscape_full documentation to provide some more information on what it does?
michal-g commented 1 month ago

Hey, I'm making this a Discussion as there isn't an outright issue being described here, although as I've mentioned in e.g. #413 we are working on some overhauls of the analysis_landscape commands as well as an updated manuscript that should address your point re: analyze_landscape_full documentation.

I will get back to soon about your other points, but in the meantime, it would be helpful if you took a look at some of our older material for the landscape analyses, the former describing the sketching used in analyze_landscape, and the latter describing the neural net mapping used in analyze_landscape_full: cryoDRGN_landscape_analysis_thesis_chapter.pdf Reshaping_the_Conformational_Space.pdf