RuntimeError with Tensors on Different Devices When Using Outlines

Describe the issue as clearly as possible:

I've encountered an issue when attempting to use the outlines library with a model downloaded from the Hugging Face Hub (turboderp/Llama-3-8B-Instruct-exl2) and specifying the CUDA device to use. Despite setting the model to run on device=1, it seems that some operations are still trying to access tensors on device=0, leading to a runtime error.

Steps/code to reproduce the bug:

import outlines 
from huggingface_hub import snapshot_download
model_name="turboderp/Llama-3-8B-Instruct-exl2"
revision="3.0bpw"
model_directory = snapshot_download(repo_id=model_name, revision=revision, local_dir="llama3")

model = outlines.models.exl2(model_directory,device=1)

react_prompt = """
Question: How do you cook a sunny side egg?
FORMAT:
Strictly use the following format:
Thought: [insert thought]
Action: [Steps to follow]"""
generator = outlines.generate.text(model)
output = generator(react_prompt, stop_at="Action: ")
print(output)

Expected result:

Generation stops at "Action:" without RuntimeError

Error message:

RuntimeError                              Traceback (most recent call last)
Cell In[6], line 17
     10 react_prompt = """
     11 Question: How do you cook a sunny side egg?
     12 FORMAT:
     13 Strictly use the following format:
     14 Thought: [insert thought]
     15 Action: [Steps to follow]"""
     16 generator = outlines.generate.text(model)
---> 17 output = generator(react_prompt, stop_at="Action: ")
     18 print(output)

File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/generate/api.py:207, in SequenceGenerator.__call__(self, prompts, max_tokens, stop_at, rng)
    205 while True:
    206     try:
--> 207         last_state = next(states)
    208         if max_tokens or stop_sequences:
    209             token_ids = last_state.token_ids

File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/generate/generator.py:82, in sequence_generator(model, sampler, fsms, token_ids, sequence_weights, attention_masks, fsm_states, rng)
     80 allowed_tokens = get_allowed_tokens(fsms, fsm_states)
     81 biased_logits = bias_logits(logits, allowed_tokens)
---> 82 next_token_ids, ancestors, sequence_weights = sampler(
     83     biased_logits, sequence_weights, rng
     84 )
     86 token_ids = update_token_ids(token_ids, next_token_ids, ancestors)
     87 attention_masks = update_attention_masks(attention_masks, ancestors)

File ~/anaconda3/envs/ve-m/lib/python3.10/site-packages/outlines/samplers.py:160, in MultinomialSampler.__call__(self, next_token_logits, sequence_weights, rng)
    156 logprobs = torch.nn.functional.log_softmax(altered_next_token_logits, dim=-1)
    157 ancestors = torch.arange(
    158     altered_next_token_logits.shape[0], device=next_token_logits.device
    159 )
--> 160 weights = sequence_weights + torch.gather(logprobs, 1, next_token_ids).squeeze()
    162 return next_token_ids, ancestors, weights

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Outlines/Python version information:

Version information

python -c "from outlines import _version; print(_version.version)" 0.0.43.dev11+g78852b0

python -c "import sys; print('Python', sys.version)" Python 3.10.13 (main, Sep 11 2023, 13:21:10) [GCC 11.2.0]

Context for the issue:

I want to utilize other GPUs in the server instead of 0, be able to specify the GPU

outlines-dev / outlines