Error while training LLM judge

WJ44 commented 4 months ago

When attempting to train an LLM judge I get the following error.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], [line 16](vscode-notebook-cell:?execution_count=1&line=16)
      [3](vscode-notebook-cell:?execution_count=1&line=3) classifier_config = {
      [4](vscode-notebook-cell:?execution_count=1&line=4)     "training_dataset": ["nq_synthetic_queries.tsv"],
      [5](vscode-notebook-cell:?execution_count=1&line=5)     "validation_set": ["datasets/example_files/nq_labeled_output.tsv"],
   (...)
     [12](vscode-notebook-cell:?execution_count=1&line=12)     "model_choice": "microsoft/deberta-v3-xsmall",
     [13](vscode-notebook-cell:?execution_count=1&line=13) }
     [15](vscode-notebook-cell:?execution_count=1&line=15) ares = ARES(classifier_model=classifier_config)
---> [16](vscode-notebook-cell:?execution_count=1&line=16) results = ares.train_classifier()
     [17](vscode-notebook-cell:?execution_count=1&line=17) print(results)

File ~/projects/ARES/ares/ares.py:134, in ARES.train_classifier(self)
    [132](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:132)     print("Skipping binary classifier configuration due to missing parameters.")
    [133](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:133) else:
--> [134](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/ares.py:134)     binary_classifer_config(**self.classifier_model_config)

File ~/projects/ARES/ares/binary_classifier.py:164, in binary_classifer_config(training_dataset, validation_set, label_column, num_epochs, patience_value, learning_rate, training_dataset_path, validation_dataset_path, model_choice, validation_set_scoring, assigned_batch_size, gradient_accumulation_multiplier, number_of_runs, num_warmup_steps, training_row_limit, validation_row_limit)
    [147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:147) tokenized_datasets = initalize_dataset_for_tokenization(tokenizer, training_dataset_arrow, validation_dataset_arrow, test_dataset_arrow)
    [149](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:149) train_and_eval_settings = {
    [150](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:150)     "number_of_runs": number_of_runs,
    [151](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:151)     "tokenized_datasets": tokenized_datasets,
   (...)
    [161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:161)     "gradient_accumulation_multiplier": gradient_accumulation_multiplier
    [162](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:162) }
--> [164](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:164) model, avg_train_losses, avg_valid_losses, eval_dataloader, inference_times = train_and_evaluate_model(train_and_eval_settings)
    [166](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:166) total_predictions, total_references, metric = evaluate_model(model, model_choice, checkpoint_path, device, eval_dataloader, inference_times)
    [168](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/binary_classifier.py:168) print_and_save_model(total_predictions, total_references, checkpoint_path, metric)

File ~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:740, in train_and_evaluate_model(params)
    [738](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:738) outputs = model(**new_batch)
    [739](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:739) loss = criterion(outputs, batch['labels'].to(device))
--> [740](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:740) loss.backward()
    [742](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:742) # Gradient accumulation
    [743](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py:743) gradient_accumulation_count += 1

File ~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:522, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    [512](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:512) if has_torch_function_unary(self):
    [513](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:513)     return handle_torch_function(
    [514](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:514)         Tensor.backward,
    [515](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:515)         (self,),
   (...)
    [520](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:520)         inputs=inputs,
    [521](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:521)     )
--> [522](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:522) torch.autograd.backward(
    [523](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:523)     self, gradient, retain_graph, create_graph, inputs=inputs
    [524](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py:524) )

File ~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:266, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    [261](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:261)     retain_graph = create_graph
    [263](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:263) # The reason we repeat the same comment below is that
    [264](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:264) # some Python versions print out the first line of a multi-line function
    [265](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:265) # calls in the traceback and some print out the last line
--> [266](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:266) Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    [267](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:267)     tensors,
    [268](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:268)     grad_tensors_,
    [269](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:269)     retain_graph,
    [270](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:270)     create_graph,
    [271](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:271)     inputs,
    [272](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:272)     allow_unreachable=True,
    [273](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:273)     accumulate_grad=True,
    [274](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/wesley/projects/ARES/~/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py:274) )

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am using the xsmall model to make testing quicker and made the necessary change in embossing size in the CustomBERTModel class. The same error happens when using the (default) large model. I am using a shortened synthetic queries file to make testing quicker as well, but the same happens with the example file provided.

I am somewhat at a loss, since I am sure it was working earlier.

WJ44 commented 4 months ago

Also happens when not in notebook.

/opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "/home/wesley/projects/ARES/reproduce.py", line 16, in <module>
    results = ares.train_classifier()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wesley/projects/ARES/ares/ares.py", line 139, in train_classifier
    binary_classifer_config(**self.classifier_model_config)
  File "/home/wesley/projects/ARES/ares/binary_classifier.py", line 164, in binary_classifer_config
    model, avg_train_losses, avg_valid_losses, eval_dataloader, inference_times = train_and_evaluate_model(train_and_eval_settings)
                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wesley/projects/ARES/ares/LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py", line 800, in train_and_evaluate_model
    loss.backward()
  File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/wesley/miniconda3/envs/ares/lib/python3.11/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
  0%|

WJ44 commented 3 months ago

I have tried in on different hardware and OSes and run into the same problem everywhere.

WJ44 commented 3 months ago

This happens even in a clean install in a clean VM when trying the example code for training a classifier.

WJ44 commented 2 months ago

Solved by #71

stanford-futuredata / ARES

Error while training LLM judge #66