Dispatch Error When Using Quantisation

Description

I am witnessing a dispatch error when using 4bit quantised model. First, note that this is happening when instancitating a LanguageModel from an already existing transformer model in 4bit. Also, note that the 4bit weights should only lie on GPU, and can't go on CPU.

Working Example

from nnsight import LanguageModel

nnsight_model = LanguageModel("gpt2", device_map="auto", load_in_4bit=True)
with nnsight_model.trace('The Eiffel Tower is in the city of') as tracer:
    hidden_states = nnsight_model.transformer.h[0].mlp.act.output[0].clone().save()

Failing Example

from nnsight import LanguageModel
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2", device_map="auto", load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
nnsight_model = LanguageModel(model, tokenizer=tokenizer)
with nnsight_model.trace('The Eiffel Tower is in the city of') as tracer:
    hidden_states = nnsight_model.transformer.h[0].mlp.act.output[0].clone().save()

Info

nnsight 0.2.11
torch 2.2.1+cu121
transformer 4.38.2
accelerate 0.29.1
bitsandbytes 0.43.0

The Error

The error can be found in this illustrative notebook: https://colab.research.google.com/drive/1n9A7MF8JE2lf26e9gOXRi2HaDjl4DjgX?usp=sharing

ndif-team / nnsight

Dispatch Error When Using Quantisation #106

Description

Working Example

Failing Example

Info

The Error