Open WJ44 opened 4 months ago
Also happens when not in notebook.
/opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "/home/wesley/projects/ARES/reproduce.py", line 16, in <module>
results = ares.train_classifier()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wesley/projects/ARES/ares/ares.py", line 139, in train_classifier
binary_classifer_config(**self.classifier_model_config)
File "/home/wesley/projects/ARES/ares/binary_classifier.py", line 164, in binary_classifer_config
model, avg_train_losses, avg_valid_losses, eval_dataloader, inference_times = train_and_evaluate_model(train_and_eval_settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wesley/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py", line 800, in train_and_evaluate_model
loss.backward()
File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
0%|
I have tried in on different hardware and OSes and run into the same problem everywhere.
This happens even in a clean install in a clean VM when trying the example code for training a classifier.
Solved by #71
When attempting to train an LLM judge I get the following error.
I am using the xsmall model to make testing quicker and made the necessary change in embossing size in the CustomBERTModel class. The same error happens when using the (default) large model. I am using a shortened synthetic queries file to make testing quicker as well, but the same happens with the example file provided.
I am somewhat at a loss, since I am sure it was working earlier.