ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
307 stars 76 forks source link

extracting particles from latent space? #15

Closed heejongkim closed 3 years ago

heejongkim commented 4 years ago

Hi,

I think it's a pretty naive question but it would be fantastic if it's possible. (I hope i didn't miss this info from README)

From PCA or UMAP, would it be possible to extract particle information from specific classification(s)? If possible, I'd like to get subset of specific "classes" and go back to other software to refine or classify furthermore.

Thanks for the fantastic software again.

best, heejong

kimdn commented 4 years ago

Hi Ellen,

I have a same feature request as well. It would be helpful to extract particles (or particle info so that I can extract via relion) from generally divided classes in UMAP representation.

As of now, I just use eval_vol to extract particles from z=1 based train_vae.

Thank you, Doonam

zhonge commented 4 years ago

Hi Heejong and Doonam,

Great question, and a somewhat multifaceted answer depending on what you're looking to do.

The jupyter notebook template from cryodrgn analyze has the functionality to interactively select particles. Look for the "Interactive selection" section. There's a bit of trickiness, where, in order to save the selected indices, the next cell must be executed in order to store the selected particles into the ind_selected variable. The selected particle images can then be viewed later on in the jupyter notebook, or saved as indices into the original dataset.

You could also select specific classes from the default k-means clustering with something like:

ind_selected = np.where(kmeans_labels == 8)[0] # keep one cluster, 8 in this example

keep_clusters = (0,1,3,10) # keep multiple clusters ind_selected = np.array([i for i,label in enumerate(kmeans_labels) if label in keep_clusters])

Once you have a selection saved as a .pkl file, then you can provide that to cryodrgn with the --ind argument to train a new model on a subset of the images. With this approach, note that you have to be careful if you're doing multiple rounds of filtering in order to get the correct indices into the original dataset, i.e. if your results come from a training run that already uses an --ind subselection.

There's a few helper scripts in the utils subdirectory to filter .star, .pkl, and .mrcs files with the index array.

Lastly, I have some scripts to write out new .star files based on a (filtered) .mrcs particle stack and ctf parameters. I can clean these up and add them to the repo shortly.

Let me know if I can clarify anything!

Ellen

zhonge commented 4 years ago

Added a script (commit 6bb0e81bbce6ddb3d182ea467ce7d0f59693af61) to write out a star file from a .mrcs file, ctf.pkl, and optionally an index array .pkl. There are a lot of different file formats for particle stacks, so I kept the input type simple for now (expects just a single .mrcs file). Let me know if there are any issues. More features upon request 🙂

kimdn commented 3 years ago

Hi Ellen,

Thank you for your new update.

Unfortunately, I can't see interactive working.

I found "Interactive selection" and ran

widget, fig, ind_table = analysis.ipy_plot_interactive(df)
VBox((widget,fig,ind_table))

but I see VBox(children=(interactive(children=(Dropdown(description='xaxis', options=('UMAP1', 'UMAP2', 'PC1', 'PC2', 'P… only although I read In the first cell, select points with the lasso tool. The table widget is dynamically updated with the most recent selection's indices.

I ran

ind_selected = ind_table.data[0].cells.values[0] # save table values
ind_selected = np.array(ind_selected)
ind_selected_not = np.array(sorted(set(np.arange(len(df))) - set(ind_selected)))

print('Selected indices:')
print(ind_selected)
print('Number of selected points:')
print(len(ind_selected))
print('Number of unselected points:')
print(len(ind_selected_not))

I see

Selected indices:
[     0      1      2 ... 104499 104500 104501]
Number of selected points:
104502
Number of unselected points:
0

This second cell looks fine, since its purpose is nothing but save the particle indices as designated by lasso tool in first cell.

Until this interaction selection, all cells ran fine including

# Enable interactive widgets
!jupyter nbextension enable --py widgetsnbextension

that resulted in

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK
zhonge commented 3 years ago

I recently updated the write_starfile.py script to optionally save poses (in RELION Euler angle format) in the star file -- available in cryoDRGN v0.3.1.

Doonam, if you are still having issues with jupyter notebook widget showing up, can you file a new issue? Thanks!