[BUG] Intermittent RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

fweckesser commented 1 year ago

I am trying to deploy a Roberta Base model using Deepspeed on an AWS p3.2xlarge. I am getting great performance from it when it works, but there does seem to be an intermittent issue causing inference failures.

It is a Roberta Base inference model running on AWS p3.2xlarges (single V100 GPU). This is a Docker container based service using the newest torchserve built to run on Ubuntu 20.04.4. I currently run 4 works and each worker loads two models to the GPU. Memory of the GPU is 16Gb of which it only uses a steady 8Gb across all workers and models.

The drivers for both the host and container are NVIDIA Driver Version: 470.57.02, CUDA Version: 11.4 host, 11.7 container. The container is using Python 3.8.10, deepspeed==0.8.3, torch==1.13.1, transformers==4.15.0. Everything went well through DEV and QA load testing but failed when I took it to prod on a much larger set of instances. I spun up 10 p3.2xls when I went to diagnose the issue I had in the prod release. Five of them performed as expected but the other five all had the intermittent problem I saw in production. The following shows the error I am seeing from the deepspeed generated exception.

RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Here is the entire stack trace.

Traceback (most recent call last): File "/opt/enrichment/ffnsa/ffnsa/service.py", line 277, in call results = self.run_mention(spans, bad_spans, topics, segments, category) File "/opt/enrichment/ffnsa/ffnsa/service.py", line 253, in run_mention self.predict_mention(related_df, related_ds) File "/opt/enrichment/ffnsa/ffnsa/service.py", line 391, in predict_mention outputs = self.mention_model(batch) File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 562, in forward outputs = self.module(inputs, kwargs) File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 1203, in forward outputs = self.roberta( File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 851, in forward encoder_outputs = self.encoder( File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/venv/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 526, in forward layer_outputs = layer_module( File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/venv/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 157, in forward self.attention(input, File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/venv/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 110, in forward qkv_out = self.linear_func(input=input, File "/home/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/venv/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/op_binding/linear.py", line 25, in forward qkv_out = self.linear_func(input, RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Two things of note. One is, as mentioned, only half of the p3s, all from the same AMI, demonstrated the problem. The second is that after monitoring the logs for a while, I noticed that when the exception was generated, it was generated at very precise 5 second intervals, almost as if some kind of scheduled task (garbage collection?) was occurring. Below are my log entries for the exceptions. I tried varying the load from the maximum 36 inferences per second down to just one request per second and the 5 second interval exception was unchanged.

2023-04-02T01:35:55,063 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:54358 "POST /predictions/financialTopics HTTP/1.1" 500 6825 2023-04-02T01:36:00,106 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:56166 "POST /predictions/financialTopics HTTP/1.1" 500 6234 2023-04-02T01:36:05,378 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:42296 "POST /predictions/financialTopics HTTP/1.1" 500 5659 2023-04-02T01:36:10,426 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:38832 "POST /predictions/financialTopics HTTP/1.1" 500 5500 2023-04-02T01:36:15,466 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:40316 "POST /predictions/financialTopics HTTP/1.1" 500 5405 2023-04-02T01:36:20,412 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:42098 "POST /predictions/financialTopics HTTP/1.1" 500 5869 2023-04-02T01:36:25,496 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:45188 "POST /predictions/financialTopics HTTP/1.1" 500 6246 2023-04-02T01:36:30,477 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:46608 "POST /predictions/financialTopics HTTP/1.1" 500 6996 2023-04-02T01:36:35,329 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:59924 "POST /predictions/financialTopics HTTP/1.1" 500 5470 2023-04-02T01:36:41,501 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:51444 "POST /predictions/financialTopics HTTP/1.1" 500 6672 2023-04-02T01:36:46,311 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:53468 "POST /predictions/financialTopics HTTP/1.1" 500 5997 2023-04-02T01:36:51,114 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:54446 "POST /predictions/financialTopics HTTP/1.1" 500 6565 2023-04-02T01:36:56,091 [INFO ] W-9000-financialTopics_2.0 ACCESS_LOG - /172.25.3.151:53922 "POST /predictions/financialTopics HTTP/1.1" 500 5861

To Reproduce Steps to reproduce the behavior: Using the configuration described above, run load of one or more inferences per second. The input is just text of 2-3 paragraphs per inference. The model is https://huggingface.co/roberta-base. We fine tune it as an AutoModelForSequenceClassification.

model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=path, num_labels=3)

I load both models using the following deepspeed init

model = deepspeed.init_inference(model=model, mp_size=1, dtype=torch.half, replace_with_kernel_inject=True)

As mentioned above, it does not exhibit itself every time, but I have found that setting PYTORCH_CUDA_ALLOC_CONF max_split_size_mb:512 greatly increases the likelihood it will happen.

Expected behavior I do not expect to see the RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/venv/lib/python3.8/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/venv/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.8.3, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.4 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

I am unable to share the the docker image.

Let me know if you need anything else.

Thanks

whitelok commented 1 year ago

may be not a bug.

following code is work for me

import torch
import deepspeed
from deepspeed.accelerator import get_accelerator
from deepspeed.ops.op_builder import InferenceBuilder
from torch.nn import functional as F

if not deepspeed.ops.__compatible_ops__[InferenceBuilder.NAME]:
    pytest.skip("Inference ops are not available on this system", allow_module_level=True)

inference_module = None

if inference_module is None:
    inference_module = InferenceBuilder().load()
    inference_module.allocate_workspace_fp32(1, 1, 1, 2, 30, 30, False, 1, 1, 1)

with torch.no_grad():
    input_m = torch.randn(1, 1, 2).cuda()
    weight = torch.Tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]).cuda().transpose(1, 0)
    bias = torch.Tensor([1.0, -1.0, 1.0]).cuda()
    linear_input = torch.Tensor([[[1.0, 2.0]]]).cuda()
    demo_output = F.linear(linear_input, weight, bias)
    fused_output = inference_module.linear_layer_fp32(linear_input.transpose(2, 1).reshape([1, 1, 2]), weight.transpose(1, 0), bias, True, False, 1, False)

fweckesser commented 1 year ago

Thanks. I will try it under load. Single ad hoc requests do not exhibit the behavior.

jianyu-cs commented 1 year ago

Another person facing with the same issue, when playing with deepspeed-inference on multiple GPUs.

mrwyattii commented 1 year ago

@fweckesser I just ran into an issue that was similar to yours and I was able to fix it in #4384. I'm not 100% certain if this is the same exact problem you are facing, but it looks like you are loading 2 models in a single process with DS-Inference:

I currently run 4 works and each worker loads two models to the GPU

I was also attempting to do this and encountered:

RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

If this does not fix the problem for you, please share a reproducer that I could use to debug. Same to @jianyu-cs - if you have a script to reproduce the error, I will debug it. Thanks

microsoft / DeepSpeed