ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
316 stars 76 forks source link

landscape_analysis_full: expected a 'cuda' device type for generator but found 'cpu' #300

Open DerLorenz opened 1 year ago

DerLorenz commented 1 year ago

Hi,

I am using cryodrgn v23 on a HPC-cluster. So far, have ran the complete 'standard' cryodrgn pipeline succesfully as many times before. With my new dataset I was eager to try the landscape analysis. Everything works nicely until I try to run analyze_landscape_full After volume generation I get the following error: RuntimeError: Expected a 'cuda' device type for generator but found 'cpu' I checked my pytorch and everything seems fine:

python
>>> import torch
>>> torch.__version__
'2.0.1+cu117'
>>> print(torch.cuda.is_available())
True

Also, the previous cryodrgn jobs I ran, used cuda/gpus without any issue. I am not sure what could be the issue here and any help would be very much appreciated.

Best, Lorenz

michal-g commented 12 months ago

Hey Lorenz, can you try this again with the latest version? Also, can you share the analyze_landscape_full command you were using when you got this error?

DerLorenz commented 12 months ago

Hey Michal,

Thanks for the response. I will try it with the newest version by tomorrow and post an update here. I submitted using the following parameters (according to the .out file): (INFO) (analyze_landscape_full.py) (29-Aug-23 12:21:33) Loaded configuration: {'cmd': ['/path/to/Anaconda3/envs/cryodrgn_v23/bin/cryodrgn', 'train_vae', 'particles.128.ft.txt', '--preprocessed', '--poses', 'pose.pkl', '--ctf', 'ctf.pkl', '--zdim', '8', '-n', '50', '-o', 'vae_128_8', '--multigpu'], I submitted to our HPC cluster using slurm. Here I allocated a single core and a single GPU with more than enough memory. Maybe the --multigpu made the issue here? Though I did not specified it and I do not even see this option when checking cryodrgn analyze_landscape --help

Thanks for looking into this!

DerLorenz commented 9 months ago

Hi Michal,

Sorry for being unresponsive for so long. I redid the analysis using cryoDRGN 3.x. I submitted the following command using SLURM to our HPC: cryodrgn analyze_landscape_full vae_128_8 49 --landscape-dir landscape_masked.49 -o landscape_masked.49/landscape_full Following batch parameters were set: sbatch -p g --mem=35G --gres=gpu:1 --time=08:00:00 After generating the volume embeddings the job failedn with the following error:

Traceback (most recent call last): File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/bin/cryodrgn", line 8, in <module> sys.exit(main()) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/cryodrgn/__main__.py", line 72, in main args.func(args) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/cryodrgn/commands/analyze_landscape_full.py", line 333, in main embeddings_all = train_model(z, embeddings, outdir, zfile, args) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/cryodrgn/commands/analyze_landscape_full.py", line 268, in train_model train(args, model, device, train_loader, optimizer, epoch) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/cryodrgn/commands/analyze_landscape_full.py", line 114, in train for batch_idx, (data, target) in enumerate(train_loader): File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__ return self._get_iterator() File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 386, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1084, in __init__ self._reset(loader, first_iter=True) File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/sampler.py", line 282, in __iter__ for idx in self.sampler: File "/groups/haselbach/software/Anaconda3/envs/cryodrgn_V3x/lib/python3.9/site-packages/torch/utils/data/sampler.py", line 164, in __iter__ yield from torch.randperm(n, generator=generator).tolist() RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

DerLorenz commented 9 months ago

Again pytorch seems to be working fine.

michal-g commented 9 months ago

thanks for the update, let me try to replicate this error and I'll get back to you!

DerLorenz commented 2 months ago

Just asking if there is some update here. I still would like to try this tools.

michal-g commented 2 months ago

I haven't been able to replicate this error yet, but we are presently working on a new refactored version of this tool, which will hopefully help us resolve issues such as this. We should have a further update by the end of the month!

DerLorenz commented 5 days ago

Any updates?

michal-g commented 5 days ago

Still haven't seen this error on our side — have you tried again with the latest version (v3.4.2)?