[BUG] The specified pointer resides on host memory and is not registered with any CUDA device.

Describe the bug For some reason the following code gives me the the error: RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device. The first call (query encoding) goes well, but the second one (first iteration of dataloader, encoded = model_ds(**batch_input)) fails.

To Reproduce

import torch

from transformers import AutoModel, BertTokenizerFast, DataCollatorWithPadding
from datasets import Dataset
import datasets
datasets.disable_progress_bars()
from copy import deepcopy
import deepspeed

query = 'Is Creole a pidgin of French?'

texts = ['Krio is an English-based creole from which descend Nigerian Pidgin English and Cameroonian Pidgin English and Pichinglis. It is also similar to English-based creole languages spoken in the Americas, especially the Gullah language, Jamaican Patois (Jamaican Creole), and Bajan Creole but it has its own distinctive character. It also shares some linguistic similarities with non-English creoles, such as the French-based creole languages in the Caribbean.',
 'Mauritian Creole, which is spoken by an estimated 90% of the population, is considered to be the native language of the country and is used most often in informal settings. It was developed in the 18th century by slaves who used a pidgin language to communicate with each other as well as with their French masters, who did not understand the various African languages. The pidgin evolved with later generations to become a casual language. Mauritian Creole is a French-based creole due to its close ties with French pronunciation and vocabulary.',
 'Louisiana Creole is a contact language that arose in the 18th century from interactions between speakers of the lexifier language of Standard French and several substrate or adstrate languages from Africa. Prior to its establishment as a Creole, the precursor was considered a pidgin language. The social situation that gave rise to the Louisiana Creole language was unique, in that the lexifier language was the language found at the contact site. More often the lexifier is the language that arrives at the contact site belonging to the substrate/adstrate languages. Neither the French, the French-Canadians, nor the African slaves were native to the area; this fact categorizes Louisiana Creole as a contact language that arose between exogenous ethnicities. Once the pidgin tongue was transmitted to the next generation as a "lingua franca" (who were then considered the first native speakers of the new grammar), it could effectively be classified as a creole language.',
 'The first French colonists arrived in Mauritius in 1721 and decided to settle for strategic reasons. Indeed, they used the island as a military and commercial cornerstone because, at this time of the history, the French had in mind to conquer India and expand the spice trade. Mauritius was therefore useful as a port of call since it gave the French the opportunity to obtain fresh supplies and rest from their long journey. Gradually, the French will call in slaves mainly coming from Africa, Madagascar, Mozambique, and India in order to build a large harbour in Mauritius. This interaction between French and African people will then lead to the creation of a pidgin, that is to say, a language that is only used by populations of distinct origins as a means of communication. A "pidgin", by definition, isn’t the mother tongue of any community but rather emerges from the interaction of each language spoken by each distinct community. When a pidgin becomes further used by new generations as a mother tongue, it loses its name of pidgin and begins to be called "Creole". As the consequence of the French colonization, the Creole still spoken nowadays in Mauritius has French roots. Moreover, from this time on, the country will remain well influenced by the French culture, its language and its religion. A great number of Indians will incidentally convert to Christianity. The Indo-Mauritians who adopted a French culture at this point of the history will be considered as Creoles, as opposed to the Indo-Mauritians who will settle in the island much later.',
 'A French creole, or French-based creole language, is a creole language (contact language with native speakers) for which French is the "lexifier". Most often this lexifier is not modern French but rather a 17th-century koiné of French from Paris, the French Atlantic harbors, and the nascent French colonies. French-based creole languages are spoken natively by millions of people worldwide, primarily in the Americas and on archipelagos throughout the Indian Ocean. This article also contains information on French pidgin languages, contact languages that lack native speakers.']

model_path = 'BAAI/bge-base-en-v1.5'
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path,
                                  trust_remote_code=True)
model.eval().to('cuda')

ds_engine = deepspeed.init_inference(deepcopy(model),
                                 tensor_parallel={"tp_size": 1},
                                 dtype=torch.half,
                                 checkpoint=None,
                                 replace_with_kernel_inject=True)
model_ds = ds_engine.module.to('cuda').half()

candidates_dataset = Dataset.from_dict({'text': texts})
candidates_dataset = candidates_dataset.map(
        lambda x: tokenizer(x['text'],
                            padding=False,
                            max_length=512,
                            truncation=True),
        batched=True, remove_columns=['text']
    )
candidates_dataset.set_format(type='torch', columns=['input_ids',
                                                      'attention_mask',
                                                      'token_type_ids'])

tokenized_query = tokenizer(query,
                            padding=False,
                            max_length=512,
                            truncation=True,
                            return_tensors='pt')

dataloader = torch.utils.data.DataLoader(candidates_dataset,
                                          batch_size=8,
                                          collate_fn=DataCollatorWithPadding(tokenizer, return_tensors='pt', padding=True),
                                        pin_memory=True
                                          )

with torch.no_grad():
  query_input = {k: v.to('cuda') for k, v in tokenized_query.items() if k in ["input_ids", "token_type_ids", "attention_mask"]}
  query_output = model_ds(**query_input)

  for batch_texts in dataloader:
      batch_input = {k: v.to('cuda') for k, v in batch_texts.items() if k in ["input_ids", "token_type_ids", "attention_mask"]}
      encoded = model_ds(**batch_input)

Here is the full traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-6-39636e7ddcb2>](https://localhost:8080/#) in <cell line: 23>()
     27   for batch_texts in dataloader:
     28       batch_input = {k: v.to('cuda') for k, v in batch_texts.items() if k in ["input_ids", "token_type_ids", "attention_mask"]}
---> 29       encoded = model_ds(**batch_input)

15 frames
[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_bert.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1135         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1136 
-> 1137         encoder_outputs = self.encoder(
   1138             embedding_output,
   1139             attention_mask=extended_attention_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_bert.py](https://localhost:8080/#) in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    688                 )
    689             else:
--> 690                 layer_outputs = layer_module(
    691                     hidden_states,
    692                     attention_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/deepspeed/model_implementations/transformers/ds_transformer.py](https://localhost:8080/#) in forward(self, input, input_mask, attention_mask, attn_mask, head_mask, layer_past, get_key_value, get_present, encoder_output, enc_dec_attn_mask, x, encoder_hidden_states, encoder_attention_mask, use_cache, alibi, output_attentions, layer_head_mask, past_key_value, **kwargs)
    169         with torch.no_grad():
    170             attention_output, key, value, context_outputtn_ctx, inp_norm = \
--> 171                                      self.attention(input,
    172                                               input_mask,
    173                                               head_mask,

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py](https://localhost:8080/#) in forward(self, input, input_mask, head_mask, layer_past, get_present, encoder_hidden_states, encoder_attention_mask, output_attentions, norm_w, norm_b, alibi)
    158                                     beta=norm_b)
    159 
--> 160         context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
    161                                                                        input_mask=input_mask,
    162                                                                        layer_past=layer_past,

[/usr/local/lib/python3.10/dist-packages/deepspeed/ops/transformer/inference/ds_attention.py](https://localhost:8080/#) in compute_attention(self, qkv_out, input_mask, layer_past, alibi)
     99             input_mask = torch.empty(1)
    100 
--> 101         attn_key_value = self.score_context_func(
    102             query_key_value=qkv_out,
    103             attn_mask=((1 - input_mask).to(qkv_out.dtype) *

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1530             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531         else:
-> 1532             return self._call_impl(*args, **kwargs)
   1533 
   1534     def _call_impl(self, *args, **kwargs):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1539                 or _global_backward_pre_hooks or _global_backward_hooks
   1540                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541             return forward_call(*args, **kwargs)
   1542 
   1543         try:

[/usr/local/lib/python3.10/dist-packages/deepspeed/ops/transformer/inference/op_binding/softmax_context.py](https://localhost:8080/#) in forward(self, query_key_value, attn_mask, heads, num_kv, norm_factor, no_masking, layer_id, num_layers, alibi)
     39             alibi = torch.empty(1)
     40 
---> 41         output = self.softmax_context_func(query_key_value, attn_mask, self.config.rotary_dim, self.config.rotate_half,
     42                                            self.config.rotate_every_two, heads, num_kv, norm_factor,
     43                                            self.config.triangular_masking, self.config.local_attention,

RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Expected behavior If I am not doing anything wrong with iterators over data, this code should run without errors.

ds_report output

[2024-05-22 21:58:52,731] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.3.0+cu121
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.14.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 5.68 GB

Screenshots

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: 1 T4 GPU
Torch: 2.3.0+cu121, Transformers: 4.41.0
Python version: 3.10.12
Any other relevant info about your setup: Reproducible in Google Colab

Docker context

Additional context See notebook here: https://colab.research.google.com/drive/1eHj7V8dwhIPJrhvk4qN7oholevQ6n0XP?usp=sharing

Might be similar to

3795
3402
3178

microsoft / DeepSpeed

[BUG] The specified pointer resides on host memory and is not registered with any CUDA device. #5561

Screenshots

Docker context

3795

3402

3178