ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
315 stars 75 forks source link

RuntimeError: KLD is nan #302

Open JunhoeK2 opened 1 year ago

JunhoeK2 commented 1 year ago

Hello,

I am trying to run cryodrgn with my own dataset of 532,539 particles, which are pretty much cleaned and globally aligned in CryoSPARC. Its original box size was 256 and down sampled to 128 for training, and I got an error message saying 'RuntimeError: KLD is nan' as shown at the end of the running log. I confirmed the conversion was done correctly by checking backproject map. I see multiple warning messages as well, but not sure they are directly related to the error.

The script that I used for running is below: cryodrgn train_vae particles_128.mrcs --ctf ctf.pkl --poses pose.pkl --zdim 8 -n 25 --enc-dim 256 --enc-layers 3 --dec-dim 256 --dec-layers 3 --multigpu -o training03_128_vae

Does anyone have experience with this issue or know how to solve it?

(INFO) (dataset.py) (05-Sep-23 17:56:58) Loaded 532539 128x128 images (INFO) (dataset.py) (05-Sep-23 17:56:58) Windowing images with radius 0.85 (INFO) (dataset.py) (05-Sep-23 17:57:00) Computing FFT (INFO) (dataset.py) (05-Sep-23 17:57:00) Spawning 16 processes /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/dataset.py:191: RuntimeWarning: overflow encountered in cast particles = pp.asarray( (INFO) (dataset.py) (05-Sep-23 17:59:55) Symmetrizing image data /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: overflow encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: invalid value encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: overflow encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: invalid value encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) (INFO) (dataset.py) (05-Sep-23 18:00:44) Normalized HT by 0 +/- nan (INFO) (train_vae.py) (05-Sep-23 18:00:46) Loading ctf params from /goliath/processing/kimjunh/cryodrgn/ctf.pkl (INFO) (ctf.py) (05-Sep-23 18:00:46) Image size (pix) : 128 (INFO) (ctf.py) (05-Sep-23 18:00:46) A/pix : 3.319999933242798 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusU (A) : 10656.8232421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusV (A) : 9715.7421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) Dfang (deg) : 27.725799560546875 (INFO) (ctf.py) (05-Sep-23 18:00:46) voltage (kV) : 300.0 (INFO) (ctf.py) (05-Sep-23 18:00:46) cs (mm) : 2.700000047683716 (INFO) (ctf.py) (05-Sep-23 18:00:46) w : 0.10000000149011612 (INFO) (ctf.py) (05-Sep-23 18:00:46) Phase shift (deg) : 0.0 (INFO) (train_vae.py) (05-Sep-23 18:00:46) HetOnlyVAE( (encoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=12852, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=16, bias=True) ) ) (decoder): FTPositionalDecoder( (decoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=392, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=2, bias=True) ) ) ) ) (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3790354 parameters in model (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3491856 parameters in encoder (INFO) (train_vae.py) (05-Sep-23 18:00:46) 298498 parameters in decoder (WARNING) (train_vae.py) (05-Sep-23 18:00:46) Warning: Masked input image dimension is not a mutiple of 8 -- AMP training speedup is not optimized (INFO) (train_vae.py) (05-Sep-23 18:00:46) Using 4 GPUs! (INFO) (train_vae.py) (05-Sep-23 18:00:46) Increasing batch size to 32 (INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16, grad_fn=) (INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16, grad_fn=) Traceback (most recent call last): File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/bin/cryodrgn", line 8, in sys.exit(main()) File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/main.py", line 72, in main args.func(args) File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 836, in main loss, gen_loss, kld = train_batch( File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 331, in train_batch loss, gen_loss, kld = loss_function( File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 425, in loss_function raise RuntimeError("KLD is nan") RuntimeError: KLD is nan

MatthewFu2001 commented 1 month ago

Hello! Have you solved this issue? I met the same runtime error "KLD is nan" when I am trying the heterogeneous reconstruction on the tutorial dataset EMPIAR-10049.

michal-g commented 4 weeks ago

Hi all, we are still trying to track down the root cause of this issue; see (potentially) related threads such as #136, #18, and #346. In the meantime, can you try running without the --multigpu flag, and/or trying a smaller training batch size (-b 1 or -b 2) to make sure this is not an issue with running out of memory?

Also, can you try changing the number of latent dimensions (--zdim) to see if this is specific to this particular model, or potentially an issue with the input data?