ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
307 stars 76 forks source link

Run cryodrgn using subtracted particles #105

Open helgepat opened 2 years ago

helgepat commented 2 years ago

Hi,

I just finished my first cryodrgn run on a ribosome dataset and I was wondering if it is possible to restrict the region of interest using subtracted particles (e. g. from Relion) instead of full ribosomes. I am asking because when I look at the initial volumes from the kmeans20 folder, I can see that there is mostly difference in the L1 stalk and not in my ligand. I know from subsorting in Relion that the ligand is quite heterogeneous, but it does not seem to be detected by cryodrgn.

Many regards, Helge

zhonge commented 2 years ago

Can you share the command that you ran? I'm curious about the architecture settings you used -- thanks!

helgepat commented 2 years ago

I extracted my ~460k particles from Relion in a 256px box with pixelsize 1.26 and then downsampled to 128px using cryodrgn.

The commands I ran for the initial model are:

cryodrgn preprocess job735_Bx256.mrcs -D 128 -o job735_Bx128.mrcs

cryodrgn train_vae data/job735_Bx128.ft.txt --preprocessed --ctf data/ctf.pkl --poses data/poses.pkl --zdim 8 -n 50 -o output/00_vae128 --multigpu --load output/00_vae128/weights.1.pkl > 2022-01-31_vae128c.log

zhonge commented 2 years ago

I can only speculate, but usually, the model will tend to learn larger features of variation first as they contribute more to the loss function, especially if you are in the regime where you are limited by the representation capacity (parameters) of the NN. So perhaps the L1 stalk motion is dominant over your ligand variability with the current architecture settings. You could try a larger network architecture next (--enc-dim 1024, --dec-dim 1024) and see what happens. The recommended protocol is described here: https://www.notion.so/cryoDRGN-EMPIAR-10076-tutorial-c8728dcc88e744c8827447c3ff094d19#d5707f1475d74ed482a1322506def749

I usually first recommend the default architecture only as an initial sanity check of the results since it trains faster, and seems to work for detecting outliers (e.g. junk particles). Outliers that are easily spotted in the latent representation can be removed from the dataset using the cryoDRGN_filtering.ipynb Jupyter notebook from cryodrgn analyze. In some datasets I’ve tried, these junk particles interfere with training, since the model tries to model the heterogeneity of the junk before the heterogeneity in the actual protein complex (e.g. the SARS-CoV-2 spike protein in my CCP-EM talk https://www.ccpem.ac.uk/downloads/symposium/2021/zhong_day2.pdf).

Is the nature of your ligand heterogeneity compositional (binding/unbinding) or conformational (moving around)?