ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
316 stars 76 forks source link

`analyze_landscape` fails after volume generation #254

Open olibclarke opened 1 year ago

olibclarke commented 1 year ago

Hi, Using the subset mentioned here (https://github.com/zhonge/cryodrgn/issues/249), cryodrgn analayze runs fine, but cryodrgn analyze_landscape fails.

Command: cryodrgn analyze_landscape test/00_vae_80/ 21 -d 80 --Apix 3.32

Error:

(INFO) (eval_vol.py) (26-Mar-23 12:11:40) [ 0.12438965 -0.50390625  0.90771484 -0.54882812  0.20532227  2.69921875
  0.54736328  1.07421875]
(INFO) (eval_vol.py) (26-Mar-23 12:11:41) Finished in 0:09:59.743154
(INFO) (analyze_landscape.py) (26-Mar-23 12:11:41) Copying UMAP from /ntfs_mount/ubuntu/processing/cryosparc_projects/francesca/P40/J1649/test/00_vae_80/analyze.21/umap.pkl
Traceback (most recent call last):
  File "/home/user/software/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in <module>
    sys.exit(main())
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main
    args.func(args)
  File "/home/user/software/miniconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze_landscape.py", line 492, in main
    raise NotImplementedError
NotImplementedError

EDIT: Ah - maybe this is because I hadn't run cryodrgn analyze first for this particular iteration - I didn't realize this was necessary. Maybe it would be better to check for the existence of required outputs of analyze before starting volume generation?

zhonge commented 1 year ago

Thanks for the post. Yep, you should run cryodrgn analyze first before cryodrgn analyze_landscape. I can add a note/check for that.

FYI I've been testing out the landscape analysis tool more and I'm thinking about updating the defaults to -n 500 for the volume sketch (I'm running into a lot of outliers using the default -n 1000 for some of the messier datasets I'm testing, requiring a lot of further filtering) and --linkage ward for the hierarchical clustering (also less sensitive to outliers).

The rationale for these settings depends on your use case (filtering vs. final analysis), but I would recommend -n 500 and trying out both --linkage ward and --linkage average. The clustering results will be saved in separate directories so you don't have to worry about overwriting and you can use --skip-vol and --skip-mask when rerunning cryodrgn analyze_landscape to make it extra fast.

zhonge commented 1 year ago

Just another usage note for cryodrgn analyze_landscape -- you should inspect the automatically generated mask.mrc in the output directory and make sure it covers the particle, including the heterogeneous parts of interest. You can adjust the automated masking with --thresh and --dilate options or provide your own mask with --mask. Note any user-provided mask is binarized (converted to 0/1s) and is used to select which voxels to include in the downstream analysis.

I usually run cryodrgn analyze_lanscape once, and if I want to update the mask, open some of the sketched volumes, e.g. kmeans500/vol_*99.mrc, and pick a threshold in chimerax.

olibclarke commented 1 year ago

This is very handy advice, thanks Ellen! I wonder if a non-default mask is supplied, maybe the mask name should be added to the subdirectory name in the same way that the linkage type and number of clusters is? Currently it complains that the output files exist if I try a second mask, and there is no obvious way to specify a new output subdirectory (only the base analyze_landscape job directory).