Error when finetuning Flan-T5-XXL on custom dataset

I'm trying to finetune flan-t5-xxl on custom QA task, thanks for detailed article peft. However I'm encountering error:

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
RuntimeError: Caught RuntimeError in replica 0 on device 0.
RuntimeError: mat1 and mat2 shapes cannot be multiplied (56x44 and 56x4096)

This mat values change and different each time when I try running trainer.train() multiple times.

To eliminate doubt of my custom dataset issue, I ran your notebook without changing any code, it still failed with above matmul error. Sometimes i get

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on
device: cuda:1

Apologies, I tried online to resolve this but no luck.

Versions: transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1 Sagemaker notebook instance: ml.g5.24xlarge

philschmid / deep-learning-pytorch-huggingface

Error when finetuning Flan-T5-XXL on custom dataset #10