gmontoya-dk commented 4 years ago

Dear developers I am installing topaz and testing the software i found a problem in convert the initial steps of the tutorial were fine, it seems to complain about the cuda, any ideas?

/home/guillermo/miniconda3/bin/topaz train -n 400 --num-workers=8 --train-images data/EMPIAR-10025/processed/micrographs/ --train-targets data/EMPIAR-10025/processed/particles.txt --save-prefix=saved_models/EMPIAR-10025/model -o saved_models/EMPIAR-10025/model_training.txt

Loading model: resnet8

Model parameters: units=32, dropout=0.0, bn=on

Loading pretrained model: resnet8_u32

Receptive field: 71

Using device=0 with cuda=True

Loaded 30 training micrographs with 1500 labeled particles

source split p_observed num_positive_regions total_regions

0 train 0.00163 43500 26669790

Specified expected number of particle per micrograph = 400.0

With radius = 3

Setting pi = 0.0130484716977524

minibatch_size=256, epoch_size=1000, num_epochs=10

Traceback (most recent call last): File "/home/guillermo/miniconda3/bin/topaz", line 11, in load_entry_point('topaz-em==0.2.3', 'console_scripts', 'topaz')() File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/main.py", line 146, in main args.func(args) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 685, in main , save_prefix=save_prefix, use_cuda=use_cuda, output=output) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 572, in fit_epochs , use_cuda=use_cuda, output=output) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 552, in fit_epoch metrics = step_method.step(X, Y) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/methods.py", line 103, in step score = self.model(X).view(-1) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/classifier.py", line 28, in forward z = self.features(x) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 54, in forward z = self.features(x) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 335, in forward h = self.conv0(x) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, **kwargs) File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

tbepler commented 4 years ago

This error can sometimes occur when there is not enough free RAM available on your GPU. Topaz does not require a lot of GPU RAM, but if you had other processes using the same GPU that could be the problem.

I recommend checking the GPU usage with "nvidia-smi" to see if another program is using up the GPU RAM. If there isn't one or the problem persists, then please send me your CUDA version, GPU info (output of nvidia-smi), and pytorch version so I can help debug further.

gmontoya-dk commented 4 years ago

Thanks, I am running cryosparc too, is CUDA 10 this is the output, can you run the process using the 2 GPUs, ?

I tried -d 0 1 and -d 0 -d 1 but just one GPU card was used

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 73088 C python 223MiB | | 1 29307 G /usr/bin/gnome-shell 114MiB | | 1 30287 G gnome-control-center 6MiB | | 1 51526 G /usr/bin/X 141MiB | | 1 80520 C python 385MiB | +-----------------------------------------------------------------------------+

tbepler commented 4 years ago

Looks like you have plenty of RAM. Topaz only uses one GPU when you run train (in my experience it wouldn't benefit from more GPUs anyway, because training is mostly CPU and memory transfer bottlenecked). You can switch between them using either -d 0 or -d 1. If you rerun the command, does the error recur?

gmontoya-dk commented 4 years ago

yes, the error shows up again, it seems to work when I do -d -1, so only CPU

tbepler commented 4 years ago

What cudatoolkit version do you have installed with pytorch? You can see this if you run "conda list" in your topaz environment (assuming you installed with conda).

gmontoya-dk commented 4 years ago

cudatoolkit 9.0 h13b8566_0

gmontoya-dk commented 4 years ago

that could be the problem? nvida-smi says is using cuda10

tbepler commented 4 years ago

Sounds like that is the likely culprit. I recommend upgrading cudatoolkit to version 10 and seeing if that resolves the problem.

gmontoya-dk commented 4 years ago

I have install it with cuda 10 and now the problem is solved

THX

best

G.

tbepler commented 4 years ago

Great, glad to hear that worked.

tbepler / topaz

problems with topaz test #46

Loading model: resnet8

Model parameters: units=32, dropout=0.0, bn=on

Loading pretrained model: resnet8_u32

Receptive field: 71

Using device=0 with cuda=True

Loaded 30 training micrographs with 1500 labeled particles

source split p_observed num_positive_regions total_regions

0 train 0.00163 43500 26669790

Specified expected number of particle per micrograph = 400.0

With radius = 3

Setting pi = 0.0130484716977524

minibatch_size=256, epoch_size=1000, num_epochs=10