Open siayouyang opened 10 months ago
Just wanted to add, same issue here. 0.73A images of box size 600, downsampled to 256. Trained on 100 epochs with 1024 x 3 architecture with no error messages. Ran: cryodrgn analyze J319_vae256_1 99 --Apix 1.7109375
Error seemed to occur when trying to plot the UMAPs, and correspondingly, there are no UMAP outputs in the analyze directory, but the kmeans volumes and PC1/2 directories are present.
Error output:
(INFO) (eval_vol.py) (04-Feb-24 20:35:09) Finished in 0:02:37.054760
(INFO) (analyze.py) (04-Feb-24 20:35:09) Running UMAP...
(INFO) (analyze.py) (04-Feb-24 20:36:03) Generating plots...
Traceback (most recent call last):
File "/home/xyz/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in
Thanks!
Is there any update here?
I trained successfully with cryodrgn3.1 and got same error when UMAP is calculated during analysis step.
**
.....(INFO) (eval_vol.py) (22-Apr-24 15:23:59) [-1.54320431 -0.55205655 -0.76349568 -1.33955109 -0.30333209 -0.23093295 -0.46337044 0.84459388] (INFO) (eval_vol.py) (22-Apr-24 15:24:02) Finished in 0:00:27.171158 (INFO) (analyze.py) (22-Apr-24 15:24:02) Running UMAP... /usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/numba/np/ufunc/parallel.py:371: NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled. warnings.warn(problem) (WARNING) (utils.py) (22-Apr-24 15:29:17) Warning: analyze_00_vae192_k10/umap.pkl already exists. Overwriting. (INFO) (analyze.py) (22-Apr-24 15:29:17) Generating plots... Traceback (most recent call last): File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/bin/cryodrgn", line 8, in
sys.exit(main_commands()) File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/command_line.py", line 65, in main_commands _get_commands( File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/command_line.py", line 60, in _get_commands args.func(args) File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 443, in main analyze_zN( File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 140, in analyze_zN loss = analysis.parse_loss(f"{workdir}/run.log") File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/analysis.py", line 29, in parse_loss assert m is not None AssertionError
**
Hi folks, sorry for the delay on this, I will look into debugging analysis.parse_loss
for an upcoming release!
We have been doing testing runs of analyze
in the meantime without being able to replicate this error — if you still have the run.log
files handy from within your output directories, would you mind sharing a representative sample of what is inside?
Using the case of tonyl4 as an example, to get the log file lines involving the loss values the function is searching for as it is erroring out:
grep -E 'total\sloss' J319_vae256_1/run.log
In the meantime, I have created a hotfix release that avoids adding missing total loss values to the list of plotted values, which should at least bypass this particular error:
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'cryodrgn>3.3.1' --pre
Thank you very much. Just wanted to let you know this hotfix works. Also in my experience, this error is more likely to show up if I run more than one training in a directory or if I continue a run with --load.
If the analyze
command is completing without any other errors once this hotfix is applied, that strongly suggests that something is getting corrupted in the run.log
file.
in my experience, this error is more likely to show up if I run more than one training in a directory or if I continue a run with --load
This strongly suggests that there is a problem with the run.log
being overwritten — perhaps this occurs when more than one running job is trying to write to the same directory, and thus log file? I will try some more things on my end to try to isolate the issue, and find a way for cryoDRGN to avoid this situation in the first place.
Hi! I'm having the same issue.
I noticed in my run.log file that epochs 1-7 have proper values for average gen loss and total loss:
# [Train Epoch: 7/25] [944000/945882 particles] gen loss=0.661992, kld=20.723625, beta=0.125000, loss=0.662193
# =====> Epoch: 7 Average gen loss = 0.63794, KLD = 19.547597, total loss = 0.638130; Finished in 2:25:06.440614
However, for some reason, epochs 8-25 have nan values for those two:
# [Train Epoch: 8/25] [944000/945882 particles] gen loss=0.640565, kld=19.525343, beta=0.125000, loss=0.640754
# =====> Epoch: 8 Average gen loss = nan, KLD = inf, total loss = nan; Finished in 2:23:58.590436
I'm not sure if that's related.
Just following up. I fixed the problem by fixing run.log
. I wrote a python script to use the batch info listed in the log file to calculate the average gen loss and total loss for the epochs where that info was missing. I tried to compute these the same way as in cryodrgn's train_vae.py
. Then, I just replaced the epoch summary lines with new ones where those values are filled in. Happy to share this script if others run into this problem in the future. Anyways, with the fixed run.log in place, cryodrgn analyze
works properly.
We should put a try/except clause around parsing the run.log file for the loss, so the rest of cryodrgn analyze
still completes successfully. (We can think about more robust ways to keep track of logging the training loss in the longer term.)
after training for 50 epochs successfully (INFO) (train_vae.py) (03-Feb-24 12:24:59) # =====> Epoch: 50 Average gen loss = nan, KLD = 4174575710066.985352, total loss = nan; Finished in 1:11:20.645890 (INFO) (train_vae.py) (03-Feb-24 12:24:59) Evaluating z (INFO) (train_vae.py) (03-Feb-24 12:26:15) Training complete (INFO) (train_vae.py) (03-Feb-24 12:26:15) Evaluating z (INFO) (train_vae.py) (03-Feb-24 12:27:31) Finished in 2 days, 12:38:56.559360 (1:12:46.731187 per epoch
run: nohup cryodrgn analyze 00_J602_vae256 49 --Apix 1.0825 &
error messages: (INFO) (eval_vol.py) (03-Feb-24 15:32:03) Finished in 0:12:23.313976 (INFO) (analyze.py) (03-Feb-24 15:32:03) Running UMAP... (INFO) (analyze.py) (03-Feb-24 15:35:14) Generating plots... Traceback (most recent call last): File "/home/xyy/anaconda3/envs/cryodrgn/bin/cryodrgn", line 8, in
sys.exit(main())
File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/main.py", line 74, in main
args.func(args)
File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 443, in main
analyze_zN(
File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 140, in analyze_zN
loss = analysis.parse_loss(f"{workdir}/run.log")
File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/analysis.py", line 29, in parse_loss
assert m is not None
AssertionError