ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
307 stars 76 forks source link

analyze error: assert m is not None #346

Open siayouyang opened 7 months ago

siayouyang commented 7 months ago

after training for 50 epochs successfully (INFO) (train_vae.py) (03-Feb-24 12:24:59) # =====> Epoch: 50 Average gen loss = nan, KLD = 4174575710066.985352, total loss = nan; Finished in 1:11:20.645890 (INFO) (train_vae.py) (03-Feb-24 12:24:59) Evaluating z (INFO) (train_vae.py) (03-Feb-24 12:26:15) Training complete (INFO) (train_vae.py) (03-Feb-24 12:26:15) Evaluating z (INFO) (train_vae.py) (03-Feb-24 12:27:31) Finished in 2 days, 12:38:56.559360 (1:12:46.731187 per epoch

run: nohup cryodrgn analyze 00_J602_vae256 49 --Apix 1.0825 &

error messages: (INFO) (eval_vol.py) (03-Feb-24 15:32:03) Finished in 0:12:23.313976 (INFO) (analyze.py) (03-Feb-24 15:32:03) Running UMAP... (INFO) (analyze.py) (03-Feb-24 15:35:14) Generating plots... Traceback (most recent call last): File "/home/xyy/anaconda3/envs/cryodrgn/bin/cryodrgn", line 8, in sys.exit(main()) File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/main.py", line 74, in main args.func(args) File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 443, in main analyze_zN( File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/commands/analyze.py", line 140, in analyze_zN loss = analysis.parse_loss(f"{workdir}/run.log") File "/home/xyy/anaconda3/envs/cryodrgn/lib/python3.9/site-packages/cryodrgn/analysis.py", line 29, in parse_loss assert m is not None AssertionError

tonyl4 commented 7 months ago

Just wanted to add, same issue here. 0.73A images of box size 600, downsampled to 256. Trained on 100 epochs with 1024 x 3 architecture with no error messages. Ran: cryodrgn analyze J319_vae256_1 99 --Apix 1.7109375

Error seemed to occur when trying to plot the UMAPs, and correspondingly, there are no UMAP outputs in the analyze directory, but the kmeans volumes and PC1/2 directories are present.

Error output: (INFO) (eval_vol.py) (04-Feb-24 20:35:09) Finished in 0:02:37.054760 (INFO) (analyze.py) (04-Feb-24 20:35:09) Running UMAP... (INFO) (analyze.py) (04-Feb-24 20:36:03) Generating plots... Traceback (most recent call last): File "/home/xyz/miniconda3/envs/cryodrgn/bin/cryodrgn", line 8, in sys.exit(main()) File "/home/xyz/miniconda3/envs/cryodrgn/lib/python3.10/site-packages/cryodrgn/main.py", line 74, in main args.func(args) File "/home/xyz/miniconda3/envs/cryodrgn/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 443, in main analyze_zN( File "/home/xyz/miniconda3/envs/cryodrgn/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 140, in analyze_zN loss = analysis.parse_loss(f"{workdir}/run.log") File "/home/xyz/miniconda3/envs/cryodrgn/lib/python3.10/site-packages/cryodrgn/analysis.py", line 29, in parse_loss assert m is not None AssertionError

Thanks!

MunozHernandez commented 4 months ago

Is there any update here?

I trained successfully with cryodrgn3.1 and got same error when UMAP is calculated during analysis step.

**

.....(INFO) (eval_vol.py) (22-Apr-24 15:23:59) [-1.54320431 -0.55205655 -0.76349568 -1.33955109 -0.30333209 -0.23093295 -0.46337044 0.84459388] (INFO) (eval_vol.py) (22-Apr-24 15:24:02) Finished in 0:00:27.171158 (INFO) (analyze.py) (22-Apr-24 15:24:02) Running UMAP... /usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/numba/np/ufunc/parallel.py:371: NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled. warnings.warn(problem) (WARNING) (utils.py) (22-Apr-24 15:29:17) Warning: analyze_00_vae192_k10/umap.pkl already exists. Overwriting. (INFO) (analyze.py) (22-Apr-24 15:29:17) Generating plots... Traceback (most recent call last): File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/bin/cryodrgn", line 8, in sys.exit(main_commands()) File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/command_line.py", line 65, in main_commands _get_commands( File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/command_line.py", line 60, in _get_commands args.func(args) File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 443, in main analyze_zN( File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/commands/analyze.py", line 140, in analyze_zN loss = analysis.parse_loss(f"{workdir}/run.log") File "/usr/struct_bio/anaconda3/envs/cryodrgn-3.2.0-beta/lib/python3.10/site-packages/cryodrgn/analysis.py", line 29, in parse_loss assert m is not None AssertionError

**

michal-g commented 4 months ago

Hi folks, sorry for the delay on this, I will look into debugging analysis.parse_loss for an upcoming release!

We have been doing testing runs of analyze in the meantime without being able to replicate this error — if you still have the run.log files handy from within your output directories, would you mind sharing a representative sample of what is inside?

Using the case of tonyl4 as an example, to get the log file lines involving the loss values the function is searching for as it is erroring out:

grep -E 'total\sloss' J319_vae256_1/run.log

In the meantime, I have created a hotfix release that avoids adding missing total loss values to the list of plotted values, which should at least bypass this particular error:

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'cryodrgn>3.3.1' --pre
hamid13r commented 4 months ago

Thank you very much. Just wanted to let you know this hotfix works. Also in my experience, this error is more likely to show up if I run more than one training in a directory or if I continue a run with --load.

michal-g commented 4 months ago

If the analyze command is completing without any other errors once this hotfix is applied, that strongly suggests that something is getting corrupted in the run.log file.

in my experience, this error is more likely to show up if I run more than one training in a directory or if I continue a run with --load

This strongly suggests that there is a problem with the run.log being overwritten — perhaps this occurs when more than one running job is trying to write to the same directory, and thus log file? I will try some more things on my end to try to isolate the issue, and find a way for cryoDRGN to avoid this situation in the first place.

justmwest commented 1 month ago

Hi! I'm having the same issue.

I noticed in my run.log file that epochs 1-7 have proper values for average gen loss and total loss:

# [Train Epoch: 7/25] [944000/945882 particles] gen loss=0.661992, kld=20.723625, beta=0.125000, loss=0.662193
# =====> Epoch: 7 Average gen loss = 0.63794, KLD = 19.547597, total loss = 0.638130; Finished in 2:25:06.440614

However, for some reason, epochs 8-25 have nan values for those two:

# [Train Epoch: 8/25] [944000/945882 particles] gen loss=0.640565, kld=19.525343, beta=0.125000, loss=0.640754
# =====> Epoch: 8 Average gen loss = nan, KLD = inf, total loss = nan; Finished in 2:23:58.590436

I'm not sure if that's related.

justmwest commented 1 month ago

Just following up. I fixed the problem by fixing run.log. I wrote a python script to use the batch info listed in the log file to calculate the average gen loss and total loss for the epochs where that info was missing. I tried to compute these the same way as in cryodrgn's train_vae.py. Then, I just replaced the epoch summary lines with new ones where those values are filled in. Happy to share this script if others run into this problem in the future. Anyways, with the fixed run.log in place, cryodrgn analyze works properly.