Closed gmontoya-dk closed 4 years ago
This error can sometimes occur when there is not enough free RAM available on your GPU. Topaz does not require a lot of GPU RAM, but if you had other processes using the same GPU that could be the problem.
I recommend checking the GPU usage with "nvidia-smi" to see if another program is using up the GPU RAM. If there isn't one or the problem persists, then please send me your CUDA version, GPU info (output of nvidia-smi), and pytorch version so I can help debug further.
Thanks, I am running cryosparc too, is CUDA 10 this is the output, can you run the process using the 2 GPUs, ?
I tried -d 0 1 and -d 0 -d 1 but just one GPU card was used
on Mar 23 18:43:51 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... On | 00000000:17:00.0 Off | N/A | | 34% 54C P2 85W / 250W | 234MiB / 10989MiB | 59% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... On | 00000000:65:00.0 On | N/A | | 31% 54C P2 89W / 250W | 659MiB / 10989MiB | 2% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 73088 C python 223MiB | | 1 29307 G /usr/bin/gnome-shell 114MiB | | 1 30287 G gnome-control-center 6MiB | | 1 51526 G /usr/bin/X 141MiB | | 1 80520 C python 385MiB | +-----------------------------------------------------------------------------+
Looks like you have plenty of RAM. Topaz only uses one GPU when you run train (in my experience it wouldn't benefit from more GPUs anyway, because training is mostly CPU and memory transfer bottlenecked). You can switch between them using either -d 0 or -d 1. If you rerun the command, does the error recur?
yes, the error shows up again, it seems to work when I do -d -1, so only CPU
What cudatoolkit version do you have installed with pytorch? You can see this if you run "conda list" in your topaz environment (assuming you installed with conda).
cudatoolkit 9.0 h13b8566_0
that could be the problem? nvida-smi says is using cuda10
Sounds like that is the likely culprit. I recommend upgrading cudatoolkit to version 10 and seeing if that resolves the problem.
I have install it with cuda 10 and now the problem is solved
THX
best
G.
Great, glad to hear that worked.
Dear developers I am installing topaz and testing the software i found a problem in convert the initial steps of the tutorial were fine, it seems to complain about the cuda, any ideas?
/home/guillermo/miniconda3/bin/topaz train -n 400 --num-workers=8 --train-images data/EMPIAR-10025/processed/micrographs/ --train-targets data/EMPIAR-10025/processed/particles.txt --save-prefix=saved_models/EMPIAR-10025/model -o saved_models/EMPIAR-10025/model_training.txt
Loading model: resnet8
Model parameters: units=32, dropout=0.0, bn=on
Loading pretrained model: resnet8_u32
Receptive field: 71
Using device=0 with cuda=True
Loaded 30 training micrographs with 1500 labeled particles
source split p_observed num_positive_regions total_regions
0 train 0.00163 43500 26669790
Specified expected number of particle per micrograph = 400.0
With radius = 3
Setting pi = 0.0130484716977524
minibatch_size=256, epoch_size=1000, num_epochs=10
Traceback (most recent call last): File "/home/guillermo/miniconda3/bin/topaz", line 11, in
load_entry_point('topaz-em==0.2.3', 'console_scripts', 'topaz')()
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/main.py", line 146, in main
args.func(args)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 685, in main
, save_prefix=save_prefix, use_cuda=use_cuda, output=output)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 572, in fit_epochs
, use_cuda=use_cuda, output=output)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/commands/train.py", line 552, in fit_epoch
metrics = step_method.step(X, Y)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/methods.py", line 103, in step
score = self.model(X).view(-1)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/classifier.py", line 28, in forward
z = self.features(x)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 54, in forward
z = self.features(x)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, kwargs)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/topaz/model/features/resnet.py", line 335, in forward
h = self.conv0(x)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(input, **kwargs)
File "/home/guillermo/miniconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED