How to inference in 4bit?

When I load the model in 4bit:

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("cckevinn/SeeClick", device_map="auto", trust_remote_code=True, load_in_4bit=True, do_sample=True, temperature=1e-3).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

I get the following error during inference:

RuntimeError: Input type (torch.cuda.ByteTensor) and weight type (torch.cuda.HalfTensor) should be the same

njucckevin / SeeClick

How to inference in 4bit? #4