CUDA Error When Running Batch Inference with OpenLLama Model

I'm attempting to evaluate an OpenLlama model on a test dataset. When I use single element inference, it's considerably slow, so I'm trying to utilize batching for efficiency. However, during batch inference, I'm encountering a CUDA error. Error Message

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [125,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [126,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [277,0,0], thread: [127,0,0] Assertion 'srcIndex < srcSelectDimSize' failed.
...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with 'TORCH_USE_CUDA_DSA' to enable device-side assertions.

Code for Batch Inference

from tqdm import tqdm

def make_batch_inference(dataset, batch_size=8):
all_out = []

progress_bar = tqdm(range(0, len(dataset), batch_size), desc="Inferencing")

for start_idx in progress_bar:
    end_idx = start_idx + batch_size
    batch_questions = dataset['question'][start_idx:end_idx]

    batch = tokenizer(batch_questions, return_tensors='pt', padding=True, truncation=True, max_length=512)

    with torch.cuda.amp.autocast():
        output_tokens = model.generate(
            input_ids=batch["input_ids"].to("cuda:0"), max_new_tokens=2048
        )

    batch_out = [extract_first_sparql(tokenizer.decode(tokens, skip_special_tokens=True)) for tokens in output_tokens]

    all_out.extend(batch_out)

return all_out

Loading data and dataset

test_data = load_data_from_file("data/kqapro_lcquad_test.json")
test_dataset = Dataset.from_dict(test_data)

results = make_batch_inference(test_dataset)

Additional Information

It's original a "openlm-research/open_llama_7b_v2" but I finetune it using peft. So I load the model using :

from peft import AutoPeftModelForCausalLM
device_map = {"": 0}
model = AutoPeftModelForCausalLM.from_pretrained(os.path.join(
    output_dir, 'saved_model'), device_map=device_map, torch_dtype=torch.bfloat16)
 tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Any assistance on this issue would be greatly appreciated. Thank you in advance!

openlm-research / open_llama

CUDA Error When Running Batch Inference with OpenLLama Model #93