ultralytics / .github

Ultralytics GitHub default .github repository.
https://ultralytics.com
GNU Affero General Public License v3.0
12 stars 6 forks source link

'RuntimeError: GET was unable to find an engine to execute this computation' #43

Open VikasAmaraneni opened 6 months ago

VikasAmaraneni commented 6 months ago

Hello Everyone, I'm using pytorch version=2.2.1 and CUDA=12.1, python version = 3.12.2 and I'm getting the following error;

'RuntimeError: RuntimeError Traceback (most recent call last) Cell In[16], line 47 45 num_epochs = 10 46 for epoch in range(num_epochs): ---> 47 train_loss, train_time = train(model, train_loader, criterion, optimizer) 48 val_loss, val_accuracy, val_time = validate(model, val_loader, criterion) 49 print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Train Time: {train_time:.2f}s, ' 50 f'Val Loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.4f}, Val Time: {val_time:.2f}s')

Cell In[16], line 13, in train(model, train_loader, criterion, optimizer) 11 outputs = model(inputs) 12 loss = criterion(outputs, labels) # Calculate loss between model outputs and ground truth ---> 13 loss.backward() 14 optimizer.step() 15 running_loss += loss.item() * inputs.size(0) # Update running loss

File ~/.conda/envs/torchTest1/lib/python3.12/site-packages/torch/_tensor.py:522, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs) 512 if has_torch_function_unary(self): 513 return handle_torch_function( 514 Tensor.backward, 515 (self,), (...) 520 inputs=inputs, 521 ) --> 522 torch.autograd.backward( 523 self, gradient, retain_graph, create_graph, inputs=inputs 524 )

File ~/.conda/envs/torchTest1/lib/python3.12/site-packages/torch/autograd/init.py:266, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 261 retain_graph = create_graph 263 # The reason we repeat the same comment below is that 264 # some Python versions print out the first line of a multi-line function 265 # calls in the traceback and some print out the last line --> 266 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 267 tensors, 268 gradtensors, 269 retain_graph, 270 create_graph, 271 inputs, 272 allow_unreachable=True, 273 accumulate_grad=True, 274 )

RuntimeError: GET was unable to find an engine to execute this computation'

Originally posted by @VikasAmaraneni in https://github.com/ultralytics/ultralytics/issues/4060#issuecomment-2052003987

shuyueW1991 commented 5 months ago

hi, there. I fixed a similar problem by matching the version of torch, torchvision, as well as torchaudio according to what is said on the PyTorch official release website. One such feasible solution is: torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0

VikasAmaraneni commented 5 months ago

Thank you so much, it worked.

shuyueW1991 commented 4 months ago

I run into the problem again. I think the solution is not really the matching versions between. torch, torch vision, and torch audio. The solution should be:

  1. echo $LD_LIBRARY_PATH;
  2. go to the directory
  3. rename the problematic libcudnn_cnn_train.so.8 (or whatever is mentioned in message) as a copy.
  4. Now the system wouldn't go to this env var for cuda/cudnn shit. The underlying reason is that torch brings its own cuda/cudnn. We need to make them called.
  5. Done.