Closed cbeck22 closed 1 month ago
Actually, I suspect I must have made a mistake somewhere...I'll update this when I find out what I did wrong.
Update:
Ah, I think I understand now. When running analyze_landscape
, I used the --vol-ind
option to provide an index .pkl file to filter out an outlier I identified in a previous analyze_landscape
run. The PCA was then performed on 999 volumes, not the original 1000.
However, analyze_landscape_full
doesn't have the --vol-ind
option to filter volumes, and if I manipulate cells 4, 6, and 7:
centers_ind = np.loadtxt(f'{landscape_dir}/kmeans{K}/centers_ind.txt').astype(int)
vol_pc = utils.load_pkl(f'{landscape_dir}/vol_pca_{K}.pkl')
kmeans_labels = utils.load_pkl(f'{landscape_dir}/kmeans{K}/labels.pkl')
to refer to the 999 volumes I filtered using cryodrgn_utils
and analyze_landscape
, this results in a discrepancy that messes up the plotting, because some of the particles in the full dataset would have been assigned to the 1 volume that I filtered out.
This made me realize I had some misconceptions about what analyze_landscape
and analyze_landscape_full
both do.
The clustering of the full dataset in analyze_landscape
occurs when the kmeans algorithm identifies 1000 clusters. analyze_landscape
selects a representative volume from each of the 1000 kmeans clusters and runs a PCA directly on the .mrc
files. Am I right in thinking that the agglomerative clustering algorithm clusters the 1000 representative volumes directly on their .mrc
files and NOT on the PCA coordinates?
I assumed that analyze_landscape_full
uses agglomerative clustering to cluster the full dataset of particles after doing the PCA analysis on the whole dataset. In reality, the Jupyter notebook just fetches the kmeans1000 clusters calculated by analyze_landscape
and plots those.
So, a few questions:
analyze_landscape
for the 1000 kmeans representative volumes? Is it too computationally intensive to run PCA on the .mrc
files of the full dataset, and does the neural net somehow interpolate what PCA values the rest of the dataset's particles should have?analyze_landscape_full
documentation to provide some more information on what it does?Hey, I'm making this a Discussion as there isn't an outright issue being described here, although as I've mentioned in e.g. #413 we are working on some overhauls of the analysis_landscape
commands as well as an updated manuscript that should address your point re: analyze_landscape_full
documentation.
I will get back to soon about your other points, but in the meantime, it would be helpful if you took a look at some of our older material for the landscape analyses, the former describing the sketching used in analyze_landscape
, and the latter describing the neural net mapping used in analyze_landscape_full
:
cryoDRGN_landscape_analysis_thesis_chapter.pdf
Reshaping_the_Conformational_Space.pdf
Hi!
I am currently using cryoDRGN v3.3.3 and have been working with the analyze_landscape command. From my understanding, this command performs PCA on the 1000 volumes from the kmeans1000 directory and plots each particle cluster onto the UMAP plot generated by the analyze command in the filtering notebook.
After filtering out one outlier, I reran the command on the remaining 999 volumes, which produced the following histograms and plots. As shown, the clusters from the volume PCA align well with the clusters in the UMAP plot:
However, when I applied the same clustering analysis in the Jupyter notebook generated by the analyze_landscape_full command, I observed a completely different mapping of the same 999 volumes onto the UMAP plot. In this case, the clusters identified by the PCA appear to be randomly dispersed:
I am currently attempting to write my own code in the analyze_landscape_full notebook to plot the clusters onto the UMAP plot manually. Any insights into why the behavior of this notebook differs from the analyze_landscape command would be greatly appreciated!
I'm also not very clear on what exactly
analyze_landscape_full
does. Thestate_particle_counts.png
histogram generated by theanalyze_landscape
command already seems to cluster the entire dataset of ~1 million particles. Why does theanalyze_landcape_full
command generate additional training volumes, and why does it use a neural net to assign all of the particles to clusters?Best, cbeck