Open JunhoeK2 opened 1 year ago
Hello! Have you solved this issue? I met the same runtime error "KLD is nan" when I am trying the heterogeneous reconstruction on the tutorial dataset EMPIAR-10049.
Hi all, we are still trying to track down the root cause of this issue; see (potentially) related threads such as #136, #18, and #346. In the meantime, can you try running without the --multigpu
flag, and/or trying a smaller training batch size (-b 1
or -b 2
) to make sure this is not an issue with running out of memory?
Also, can you try changing the number of latent dimensions (--zdim
) to see if this is specific to this particular model, or potentially an issue with the input data?
Hello,
I am trying to run cryodrgn with my own dataset of 532,539 particles, which are pretty much cleaned and globally aligned in CryoSPARC. Its original box size was 256 and down sampled to 128 for training, and I got an error message saying 'RuntimeError: KLD is nan' as shown at the end of the running log. I confirmed the conversion was done correctly by checking backproject map. I see multiple warning messages as well, but not sure they are directly related to the error.
The script that I used for running is below: cryodrgn train_vae particles_128.mrcs --ctf ctf.pkl --poses pose.pkl --zdim 8 -n 25 --enc-dim 256 --enc-layers 3 --dec-dim 256 --dec-layers 3 --multigpu -o training03_128_vae
Does anyone have experience with this issue or know how to solve it?
(INFO) (dataset.py) (05-Sep-23 17:56:58) Loaded 532539 128x128 images (INFO) (dataset.py) (05-Sep-23 17:56:58) Windowing images with radius 0.85 (INFO) (dataset.py) (05-Sep-23 17:57:00) Computing FFT (INFO) (dataset.py) (05-Sep-23 17:57:00) Spawning 16 processes /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/dataset.py:191: RuntimeWarning: overflow encountered in cast particles = pp.asarray( (INFO) (dataset.py) (05-Sep-23 17:59:55) Symmetrizing image data /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: overflow encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:181: RuntimeWarning: invalid value encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: overflow encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) /goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/numpy/core/_methods.py:215: RuntimeWarning: invalid value encountered in reduce arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where) (INFO) (dataset.py) (05-Sep-23 18:00:44) Normalized HT by 0 +/- nan (INFO) (train_vae.py) (05-Sep-23 18:00:46) Loading ctf params from /goliath/processing/kimjunh/cryodrgn/ctf.pkl (INFO) (ctf.py) (05-Sep-23 18:00:46) Image size (pix) : 128 (INFO) (ctf.py) (05-Sep-23 18:00:46) A/pix : 3.319999933242798 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusU (A) : 10656.8232421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) DefocusV (A) : 9715.7421875 (INFO) (ctf.py) (05-Sep-23 18:00:46) Dfang (deg) : 27.725799560546875 (INFO) (ctf.py) (05-Sep-23 18:00:46) voltage (kV) : 300.0 (INFO) (ctf.py) (05-Sep-23 18:00:46) cs (mm) : 2.700000047683716 (INFO) (ctf.py) (05-Sep-23 18:00:46) w : 0.10000000149011612 (INFO) (ctf.py) (05-Sep-23 18:00:46) Phase shift (deg) : 0.0 (INFO) (train_vae.py) (05-Sep-23 18:00:46) HetOnlyVAE( (encoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=12852, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=16, bias=True) ) ) (decoder): FTPositionalDecoder( (decoder): ResidLinearMLP( (main): Sequential( (0): MyLinear(in_features=392, out_features=256, bias=True) (1): ReLU() (2): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (3): ReLU() (4): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (5): ReLU() (6): ResidLinear( (linear): MyLinear(in_features=256, out_features=256, bias=True) ) (7): ReLU() (8): MyLinear(in_features=256, out_features=2, bias=True) ) ) ) ) (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3790354 parameters in model (INFO) (train_vae.py) (05-Sep-23 18:00:46) 3491856 parameters in encoder (INFO) (train_vae.py) (05-Sep-23 18:00:46) 298498 parameters in decoder (WARNING) (train_vae.py) (05-Sep-23 18:00:46) Warning: Masked input image dimension is not a mutiple of 8 -- AMP training speedup is not optimized (INFO) (train_vae.py) (05-Sep-23 18:00:46) Using 4 GPUs! (INFO) (train_vae.py) (05-Sep-23 18:00:46) Increasing batch size to 32 (INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16, grad_fn=)
(INFO) (train_vae.py) (05-Sep-23 18:00:56) tensor([nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0',
dtype=torch.float16, grad_fn=)
Traceback (most recent call last):
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/bin/cryodrgn", line 8, in
sys.exit(main())
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/main.py", line 72, in main
args.func(args)
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 836, in main
loss, gen_loss, kld = train_batch(
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 331, in train_batch
loss, gen_loss, kld = loss_function(
File "/goliath/sw/conda4.10/envs/cryodrgn-2.3/lib/python3.9/site-packages/cryodrgn/commands/train_vae.py", line 425, in loss_function
raise RuntimeError("KLD is nan")
RuntimeError: KLD is nan