`abinit_homo` training not converging for EMPIAR-10028?

ff98li commented 1 year ago

Dear all,

Thanks for sharing the awesome work! I have been exploring with the latest version of CryoDRGN2 recently. I was trying to perform homogeneous reconstruction using abinit_homo for EMPIAR-10028. However, after plotting out the loss it appears that the training did not converge after 60 epochs:

loss

The periodic pattern in the loss seems to coincide with the pose search update of 5 epochs reported in the paper, but I'm 100% certain if the periodic loss is attributed to the pose search steps. The resulted final volume of homogeneous reconstruction is below:

volume

Steps to reproduce

Download EMPIAR-10028 image stacks and ctf.pkl from here. (script I used)

Preprocessing and image downsampling:


for dis in $DATADIR/MRC_*
do
ls -d -1 $dis/* >> $DATADIR/mrc_list.txt
done
cryodrgn parse_ctf_csparc $DATADIR/cryosparc_P11_J4_003_particles.cs \
-o $DATADIR/ctf.pkl

cryodrgn downsample $DATADIR/mrc_list.txt -D 256 \ --max-threads 64 \ -o $DATADIR/particles.256.mrcs \ --chunk 50000

cryodrgn downsample $DATADIR/particles.256.txt -D 128 \ --max-threads 64 \ -o $DATADIR/particles.128.mrcs



[Output log](https://github.com/ff98li/files_dump/blob/master/cryodrgn_results/abinit_homo_10028/cryodrgn_preprocess_10028.out)

3. `abinit_homo` training with default hyperparameters (except that I used batch size of 32 and 60 epochs)

I also calculated the mean/median of the Frobenius norm of the difference between the output poses and the published poses, which appear to differ from the error reported in supplementary Table S4. ([script used](https://github.com/ff98li/utility_scripts/blob/main/cryoem/pose_err.py))

Your insights into this will be very much appreciated!

zhonge commented 1 year ago

I suspect the order of your images from the mrc_list.txt input, from this line:

ls -d -1 $dis/* >> $DATADIR/mrc_list.txt

Does not match the order of the CTF parameters parsed from cryosparc_P11_J4_003_particles.cs. The .cs file parameters are in the same order as the .star file from the EMPIAR 10028 entry.

ff98li commented 1 year ago

Thank you for your reply very much!

The issue indeed arose from unmatched CTF and images. The reconstruction results have achieved much better resolution after redoing the downsampling with the correct image-CTF order:

D = 128, 60 epochs d=128

D = 360, 30 epochs d=360

Training loss: loss

I notice that although the reconstruction quality of volumes gets better along the course of training, the loss seems not to have a decreasing trend. I wonder if the training has in fact already converged at this point, and if the persistent non-decreasing loss of homogeneous reconstruction is inherent in the heterogeneities of real cryo-em data.

Following my mistake of un-matching CTF-image indices from the beginning:

Do the predicted poses in the model's output pose.pkl have the same image order as the pkl that is parsed from .cs or .star, as the model sorts poses by image indices after every batch training?
I notice that if I want to perform a homogeneous reconstruction with known poses: https://github.com/zhonge/cryodrgn/blob/f8ca250d877cf1fc61a8c0ce8e1d2b4f5e802be9/cryodrgn/commands/abinit_homo.py#L554-L570 then the model would load the input poses only if I already had a pretrained model checkpoint first. I wonder if there could be a way where I can train the volume decoder with input poses without needing to load a checkpoint first (which might be potentially related my first question, as I'm not sure if the order of parsed pose pkl and that of the model output poses are consistent).

Thank you again for your reply! Your insights are very much appreciated.

zhonge commented 1 year ago

Yes
You can train a decoder model with cryodrgn train_nn with your desired architecture and input poses. Then you can load that model in cryodrgn abinit_homo to refine your poses. Make sure to use --domain hartley in cryodrgn train_nn (the default to cryodrgn train_nn is --domain fourier).

ml-struct-bio / cryodrgn

`abinit_homo` training not converging for EMPIAR-10028? #211

Steps to reproduce