RuntimeError: CUDA error: device-side assert triggered

aaronbriel commented 10 months ago

I ran into the following error, which presumably occurred at the end of training based on the logs, which seemed to indicate that training completed:

...
{'embedding_loss': 0.0038, 'learning_rate': 1.584967320261438e-05, 'epoch': 0.08}
{'embedding_loss': 0.0143, 'learning_rate': 1.601307189542484e-05, 'epoch': 0.08}
{'embedding_loss': 0.0036, 'learning_rate': 1.6176470588235296e-05, 'epoch': 0.08}
{'embedding_loss': 0.0193, 'learning_rate': 1.6339869281045753e-05, 'epoch': 0.08}
{'eval_embedding_loss': 0.066, 'learning_rate': 1.6339869281045753e-05, 'epoch': 0.08}
{'embedding_loss': 0.0031, 'learning_rate': 1.650326797385621e-05, 'epoch': 0.08}
{'embedding_loss': 0.004, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.08}
{'embedding_loss': 0.0043, 'learning_rate': 1.6830065359477125e-05, 'epoch': 0.08}
{'embedding_loss': 0.0152, 'learning_rate': 1.6993464052287582e-05, 'epoch': 0.09}
{'embedding_loss': 0.0046, 'learning_rate': 1.715686274509804e-05, 'epoch': 0.09}
{'eval_embedding_loss': 0.065, 'learning_rate': 1.715686274509804e-05, 'epoch': 0.09}
{'train_runtime': 9878.0429, 'train_samples_per_second': 79.271, 'train_steps_per_second': 1.239, 'epoch': 0.09}

Here is the stack trace:

  File "/home/ubuntu/repos/intent-classifier/src/intent_classifier.py", line 321, in train
    trainer.train()
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/setfit/trainer.py", line 411, in train
    self.train_classifier(*train_parameters, args=args)
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/setfit/trainer.py", line 788, in train_classifier
    self.model.fit(
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/setfit/modeling.py", line 289, in fit
    for batch in tqdm(dataloader, desc="Iteration", disable=not show_progress_bar, leave=False):
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 676, in _next_data
    data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 70, in pin_memory
    return [pin_memory(sample, device) for sample in data]  # Backwards compatibility.
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 70, in <listcomp>
    return [pin_memory(sample, device) for sample in data]  # Backwards compatibility.
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in pin_memory
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in <dictcomp>
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/venv/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return data.pin_memory(device)
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.

I have a clean dataset consisting of a csv with 30 classes and 50 samples per class. Here is how I configured it to be compatible for SetFit training:

from datasets import concatenate_datasets, load_dataset
import pandas as pd
from setfit import SetFitModel, Trainer, TrainingArguments
from transformers import (
    AutoTokenizer,
    EvalPrediction,
    EarlyStoppingCallback
)

try:
    df = pd.read_csv(self.data_file, encoding='utf-8')
except FileNotFoundError as exc:
    raise FileNotFoundError(f"Data file not found: {self.data_file}") from exc

labels = df['label'].unique().tolist()

# Load the full dataset
full_dataset = load_dataset('csv', data_files=self.data_file)['train']
train_subsets = []
eval_subsets = []
test_subsets = []

# Applying stratification to ensure balanced train and test subsets
unique_labels = full_dataset.unique('label_text')
for unique_label in unique_labels:
    # Filter the dataset for the current label
    label_dataset = full_dataset.filter(lambda example: example['label_text'] == unique_label)

    # Shuffle the dataset for this label
    label_dataset = label_dataset.shuffle(seed=42)

    # Split the dataset for this label into train and test subsets
    train_subset = label_dataset.select(range(30))
    eval_subset = label_dataset.select(range(30, 40))
    test_subset = label_dataset.select(range(40, 50))

    # Add the subsets to the respective lists
    train_subsets.append(train_subset)
    eval_subsets.append(eval_subset)
    test_subsets.append(test_subset)

# Concatenate the subsets to form the final train and test datasets
train_dataset = concatenate_datasets(train_subsets)
eval_dataset = concatenate_datasets(eval_subsets)
test_dataset = concatenate_datasets(test_subsets)

# Shuffle the final datasets
train_dataset = train_dataset.shuffle(seed=42)
eval_dataset = eval_dataset.shuffle(seed=42)
test_dataset = test_dataset.shuffle(seed=42)

Here is how I set up training, based on suggestions given by @tomaarsen here:

model = SetFitModel.from_pretrained(
    'sentence-transformers/paraphrase-mpnet-base-v2',
    use_differentiable_head=True,
    head_params={"out_features": len(intents)},
)

model.labels = df['label'].unique().tolist()

timestamp = datetime.now().strftime("%Y%m%d%H%M%S")

args = TrainingArguments(
    logging_steps=10,
    run_name=timestamp,
    eval_steps=50,
    save_steps=50,
    batch_size=64,
    load_best_model_at_end=True
)
callbacks = [EarlyStoppingCallback(early_stopping_patience=5)] \
    if add_early_stopping else []

trainer = Trainer(
    model=self.model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=callbacks
)

trainer.train()

tomaarsen commented 10 months ago

Hello!

Thanks for raising this - I haven't seen this issue before. SetFit trains in 2 phases (docs), and the first phase (embedding finetuning) indeed succeeded as shown by your logs. The crash seems to occur when starting the second phase (classifier training).

The error is a bit surprising to me, I haven't seen this one before. My personal recommendation is to employ the Logistic Regression classification head instead of the differentiable classification head. I've generally had better performance, training speed & reliability with that head.

The only change you would have to make is to update the model initialisation to remove the use_differentiable_head and head_params parameters:

model = SetFitModel.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2')

Feel free to run this on a much truncated dataset (train_dataset=train_dataset.select(range(200))) to relatively quickly get an idea of whether this does work as intended.

Tom Aarsen

aaronbriel commented 10 months ago

I retrained the model, removing the use_differentiable_head and head_params parameters and there were no errors. Thank you so much!

huggingface / setfit

RuntimeError: CUDA error: device-side assert triggered #478