CryoDRGN is a neural network based algorithm for heterogeneous cryo-EM reconstruction. In particular, the method models a continuous distribution over 3D structures by using a neural network based representation for the volume.
The latest documentation for cryoDRGN is available in our user guide, including an overview and walkthrough of cryoDRGN installation, training and analysis. A brief quick start is provided below.
For any feedback, questions, or bugs, please file a Github issue or start a Github discussion.
cryodrgn plot_classes
for analysis visualizations colored by a given set of class labelscryodrgn backproject_voxel
produces cryoSPARC-style FSC curve plots with phase-randomization correction of
automatically generated tight maskscryodrgn downsample
can create a new .star or .txt image stack from the corresponding stack format instead of
always writing to an .mrcs stack; now also always puts output files into a foldercryodrgn filter
such as less intrusive annotation text and np.array
instead of list
output
formatThe official release of cryoDRGN-ET for heterogeneous subtomogram analysis.
cryodrgn backproject_voxel
for voxel-based homogeneous reconstructioncryodrgn direct_traversal
to generate interpolations in the conformation latent space
between two pointswrite_star
--datadir
to cryodrgn abinit_homo
for use with .star filesbackproject_voxel
, Jupyter demonstration notebookspip
cryodrgn_utils clean
for removing extraneous output files from completed experimentscryodrgn_utils fsc
, cryodrgn_utils fsc_plot
,
cryodrgn_utils gen_mask
adapted from existing scripts — for calculating FSCs, plotting them, and
generating masks for volumes respectivelycryodrgn backproject_voxel
now produces half-maps and a half-map FSCfilter_star
to accept tilt series as wellwrite_star
, invert_constrast
, and
train_vae
(see release notes)
cryodrgn filter
interface for interactive filtering of particles as an alternative to the
cryoDRGN_filter Jupyter notebookcryodrgn abinit_homo
(now consistent with other reconstruction
tools) (https://github.com/zhonge/cryodrgn/issues/258)--preprocessed
and --ind
(https://github.com/zhonge/cryodrgn/pull/272)cryodrgn analyze
cryodrgn train_vae
with modified positional encoding, larger model
architecture, and accelerated mixed-precision training turned on by default:
--no-amp
to revert to single precision
training)--enc-dim 256
and --dec-dim
256
to revert)
--pe-type
geom_lowf
to revert)cryodrgn_utils {command}
.
cryodrgn analyze_landscape
and cryodrgn analyze_landscape_full
for automatic
assignment of classes and conformational landscape visualization. Documentation for this new feature is here:
https://www.notion.so/cryodrgn-conformational-landscape-analysis-a5af129288d54d1aa95388bdac48235a.--pe-type gaussian
)cryodrgn_utils {command} -h
for standalone utility scriptscryodrgn_utils write_star
for converting cryoDRGN particle selections to .star filescryodrgn preprocess
for large datasets (in beta testing - see
this Notion doc for details)
cryodrgn view_config
cryodrgn analyze
--multigpu
--amp
, available for NVIDIA
tensor core GPUs--qdim
and --qlayers
to
--enc-dim
and --enc-layers
; renamed decoder arguments --pdim
and
--players
to --dec-dim
and --dec-layers
--invert-data
to True by default, and
flipped the default for --window
to True by defaultcryodrgn analyze
cryodrgn graph_traversal
and cryodrgn pc_traversal
.cryodrgn
may be installed via pip
, and we recommend installing cryodrgn
in a clean conda environment.
# Create and activate conda environment
(base) $ conda create --name cryodrgn python=3.9
(cryodrgn) $ conda activate cryodrgn
# install cryodrgn
(cryodrgn) $ pip install cryodrgn
You can alternatively install a newer, less stable, development version of cryodrgn
using our beta release channel:
(cryodrgn) $ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ cryodrgn --pre
More installation instructions are found in the documentation.
First resize your particle images using the cryodrgn downsample
command:
$ cryodrgn downsample -h
We recommend first downsampling images to 128x128 since larger images can take much longer to train:
$ cryodrgn downsample [input particle stack] -D 128 -o particles.128.mrcs
The maximum recommended image size is D=256, so we also recommend downsampling your images to D=256 if your images are larger than 256x256:
$ cryodrgn downsample [input particle stack] -D 256 -o particles.256.mrcs
The input file format can be a single .mrcs
file, a .txt
file containing paths to multiple .mrcs
files, a RELION
.star
file, or a cryoSPARC .cs
file. For the latter two options, if the relative paths to the .mrcs
are broken,
the argument --datadir
can be used to supply the path to where the .mrcs
files are located.
If there are memory issues with downsampling large particle stacks, add the --chunk 10000
argument to
save images as separate .mrcs
files of 10k images.
CryoDRGN expects image poses to be stored in a binary pickle format (.pkl
). Use the parse_pose_star
or
parse_pose_csparc
command to extract the poses from a .star
file or a .cs
file, respectively.
Example usage to parse image poses from a RELION 3.1 starfile:
$ cryodrgn parse_pose_star particles.star -o pose.pkl
Example usage to parse image poses from a cryoSPARC homogeneous refinement particles.cs file:
$ cryodrgn parse_pose_csparc cryosparc_P27_J3_005_particles.cs -o pose.pkl -D 300
Note: The -D
argument should be the box size of the consensus refinement (and not the downsampled
images from step 1) so that the units for translation shifts are parsed correctly.
CryoDRGN expects CTF parameters to be stored in a binary pickle format (.pkl
).
Use the parse_ctf_star
or parse_ctf_csparc
command to extract the relevant CTF parameters from a .star
file
or a .cs
file, respectively.
Example usage for a .star file:
$ cryodrgn parse_ctf_star particles.star -o ctf.pkl
If the box size and Angstrom/pixel values are not included in the .star file under fields _rlnImageSize
and
_rlnImagePixelSize
respectively, the -D
and --Apix
arguments to parse_ctf_star
should be used instead to
provide the original parameters of the input file (before any downsampling):
$ cryodrgn parse_ctf_star particles.star -D 300 --Apix 1.03 -o ctf.pkl
Example usage for a .cs file:
$ cryodrgn parse_ctf_csparc cryosparc_P27_J3_005_particles.cs -o ctf.pkl
Next, test that pose and CTF parameters were parsed correctly using the voxel-based backprojection script. The goal is to quickly verify that there are no major problems with the extracted values and that the output structure resembles the structure from the consensus reconstruction before training.
Example usage:
$ cryodrgn backproject_voxel projections.128.mrcs \
--poses pose.pkl \
--ctf ctf.pkl \
-o backproject.128 \
--first 10000
The output structure backproject.128/backproject.mrc
will not match the consensus reconstruction exactly
as the backproject_voxel
command backprojects phase-flipped particles onto the voxel grid, and because here we
performed backprojection using only the first 10k images in the stack for quicker results.
If the structure is too noisy, we can try using more images with --first
or the
entire stack instead (without --first
).
Note: If the volume does not resemble your structure, you may need to use the flag --uninvert-data
.
This flips the data sign (e.g. light-on-dark or dark-on-light), which may be needed depending on the
convention used in upstream processing tools.
When the input images (.mrcs), poses (.pkl), and CTF parameters (.pkl) have been prepared, a cryoDRGN model can be trained with following command:
$ cryodrgn train_vae -h
Many of the parameters of this script have sensible defaults. The required arguments are:
.mrcs
or other listed file types)--poses
, image poses (.pkl
) that correspond to the input images--ctf
, ctf parameters (.pkl
), unless phase-flipped images are used--zdim
, the dimension of the latent variable-o
, a clean output directory for saving resultsAdditional parameters which are typically set include:
-n
, Number of epochs to train--uninvert-data
, Use if particles are dark on light (negative stain format)--enc-layers
, --enc-dim
, --dec-layers
, --dec-dim
--multigpu
to enable parallelized training across multiple GPUs1) It is highly recommended to first train on lower resolution images (e.g. D=128) to sanity check results 2) and perform any particle filtering.
Example command to train a cryoDRGN model for 25 epochs on an image dataset projections.128.mrcs
with poses pose.pkl
and ctf parameters ctf.pkl
:
# 8-D latent variable model, small images
$ cryodrgn train_vae projections.128.mrcs \
--poses pose.pkl \
--ctf ctf.pkl \
--zdim 8 -n 25 \
-o 00_cryodrgn128
2) After validation, pose optimization, and any necessary particle filtering, then train on the full resolution images (up to D=256):
Example command to train a cryoDRGN model for 25 epochs on an image dataset projections.256.mrcs
with poses pose.pkl
and ctf parameters ctf.pkl
:
# 8-D latent variable model, larger images
$ cryodrgn train_vae projections.256.mrcs \
--poses pose.pkl \
--ctf ctf.pkl \
--zdim 8 -n 25 \
-o 01_cryodrgn256
The number of epochs -n
refers to the number of full passes through the dataset for training, and should be modified
depending on the number of particles in the dataset. For a 100k particle dataset on 1 V100 GPU,
the above settings required ~12 min/epoch for D=128 images and ~47 min/epoch for D=256 images.
If you would like to train longer, a training job can be extended with the --load
argument.
For example to extend the training of the previous example to 50 epochs:
$ cryodrgn train_vae projections.256.mrcs \
--poses pose.pkl \
--ctf ctf.pkl \
--zdim 8 -n 50 \
-o 01_cryodrgn256 \
--load 01_cryodrgn256/weights.24.pkl # 0-based indexing
Use cryoDRGN's --multigpu
flag to enable parallelized training across all detected GPUs on the machine.
To select specific GPUs for cryoDRGN to run on, use the environmental variable CUDA_VISIBLE_DEVICES
, e.g.:
$ cryodrgn train_vae ... # Run on GPU 0
$ cryodrgn train_vae ... --multigpu # Run on all GPUs on the machine
$ CUDA_VISIBLE_DEVICES=0,3 cryodrgn train_vae ... --multigpu # Run on GPU 0,3
We recommend using --multigpu
for large images, e.g. D=256.
Note that GPU computation may not be the training bottleneck for smaller images (D=128).
In this case, --multigpu
may not speed up training (while taking up additional compute resources).
With --multigpu
, the batch size is multiplied by the number of available GPUs to better utilize GPU resources.
We note that GPU utilization may be further improved by increasing the batch size (e.g. -b 16
), however,
faster wall-clock time per epoch does not necessarily lead to faster model training since the training dynamics
are affected (fewer model updates per epoch with larger -b
),
and using --multigpu
may require increasing the total number of epochs.
Depending on the quality of the consensus reconstruction, image poses may contain errors.
Image poses may be locally refined using the --do-pose-sgd
flag, however, we recommend reaching out to the
developers for recommended training settings.
Once the model has finished training, the output directory will contain a configuration file config.yaml
,
neural network weights weights.pkl
, image poses (if performing pose sgd) pose.pkl
,
and the latent embeddings for each image z.pkl
.
The latent embeddings are provided in the same order as the input particles.
To analyze these results, use the cryodrgn analyze
command to visualize the latent space and generate structures.
cryodrgn analyze
will also provide a template jupyter notebook for further interactive visualization and analysis.
$ cryodrgn analyze -h
This script runs a series of standard analyses:
Example usage to analyze results from the direction 01_cryodrgn256
containing results after 25 epochs of training:
$ cryodrgn analyze 01_cryodrgn256 24 --Apix 1.31 # 24 for 0-based indexing of epoch numbers
Notes:
[1] Volumes are generated after k-means clustering of the latent embeddings with k=20 by default.
Note that we use k-means clustering here not to identify clusters, but to segment the latent space and
generate structures from different regions of the latent space.
The number of structures that are generated may be increased with the option --ksample
.
[2] The cryodrgn analyze
command chains together a series of calls to cryodrgn eval_vol
and other scripts
that can be run separately for more flexibility.
These scripts are located in the analysis_scripts
directory within the source code.
[3] In particular, you may find it useful to perform filtering of particles separately from other analyses. This can
done using our interactive interface available from the command line: cryodrgn filter 01_cryodrgn256
.
[4] --Apix
only needs to be given if it is not present (or not accurate) in the CTF file that was used in training.
A simple way of generating additional volumes is to increase the number of k-means samples in cryodrgn analyze
by using the flag --ksample 100
(for 100 structures).
For additional flexibility, cryodrgn eval_vol
may be called directly:
$ cryodrgn eval_vol -h
Example usage:
To generate a volume at a single value of the latent variable:
$ cryodrgn eval_vol [YOUR_WORKDIR]/weights.pkl --config [YOUR_WORKDIR]/config.yaml -z ZVALUE -o reconstruct.mrc
The number of inputs for -z
must match the dimension of your latent variable.
Or to generate a trajectory of structures from a defined start and ending point,
use the --z-start
and --z-end
arugments:
$ cryodrgn eval_vol [YOUR_WORKDIR]/weights.pkl --config [YOUR_WORKDIR]/config.yaml -o [WORKDIR]/trajectory \
--z-start -3 --z-end 3 -n 20
This example generates 20 structures at evenly spaced values between z=[-3,3], assuming a 1-dimensional latent variable model.
Finally, a series of structures can be generated using values of z given in a file specified by the arugment --zfile
:
$ cryodrgn eval_vol [WORKDIR]/weights.pkl --config [WORKDIR]/config.yaml --zfile zvalues.txt -o [WORKDIR]/trajectory
The input to --zfile
is expected to be an array of dimension (N_volumes x zdim), loaded with np.loadtxt.
Two additional commands can be used in conjunction with cryodrgn eval_vol
to generate trajectories:
$ cryodrgn pc_traversal -h
$ cryodrgn graph_traversal -h
These scripts produce a text file of z values that can be input to cryodrgn eval_vol
to generate a series of
structures that can be visualized as a trajectory in ChimeraX (https://www.cgl.ucsf.edu/chimerax).
Documentation: https://ez-lab.gitbook.io/cryodrgn/cryodrgn-graph-traversal-for-making-long-trajectories
NEW in version 1.0: There are two additional tools cryodrgn analyze_landscape
and cryodrgn analyze_landscape_full
for more comprehensive and automated analyses of cryodrgn results.
Documentation: https://ez-lab.gitbook.io/cryodrgn/cryodrgn-conformational-landscape-analysis
To perform ab initio heterogeneous reconstruction, use cryodrgn abinit_het
.
The arguments are similar to cryodrgn train_vae
, but the --poses
argument is not required.
For homogeneous reconstruction, use cryodrgn abinit_homo
.
Documentation: https://ez-lab.gitbook.io/cryodrgn/cryodrgn2-ab-initio-reconstruction
Available in beta release starting in version 3.x. Documentation for getting started can be found in the user guide. Please reach out if you have any questions!
For a complete description of the method, see:
An earlier version of this work appeared at ICLR 2020:
CryoDRGN2's ab initio reconstruction algorithms were published at ICCV:
A protocols paper that describes the analysis of the EMPIAR-10076 assembling ribosome dataset:
Heterogeneous subtomogram averaging:
Please submit any bug reports, feature requests, or general usage feedback as a github issue or discussion.