triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
120 stars 28 forks source link

DALI backend not releasing device memory #165

Closed appearancefnp closed 7 months ago

appearancefnp commented 1 year ago

Hello!

I was excited for the 16-bit TIFF decoding, but there is a bug regarding the DALI backend not releasing memory when model is unloaded. Even when you unload the DALI model and load it again, it will consume more memory than previously. Although it converges to a fixed number, it is very high - 7Gb for batch size 3.

image

In the image, you can see, that when a DALI model is loaded, the memory increases, which is fine. The dip is when the model is unloaded. It doesn't release the memory to the initial minimum. When loading the model again, the memory increases again. I've done this model "reloading" multiple times and after some time the memory growth stops.

DALI version: 1.22.0dev, but I think the problem persists with older versions too.

from nvidia.dali import pipeline_def
import nvidia.dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

def set_normalization_value(pixel_value):
    condition = pixel_value > 255.0
    neg_condition = condition ^ True
    return condition * 65535.0 + neg_condition * 255.0

@pipeline_def(
    batch_size=3,
    num_threads=1,
    device_id=0,
    output_dtype=[types.FLOAT],
    output_ndim=[4],  # Dimensions of image, not including batch dimension
)
def decode_pipeline():
    images = fn.external_source(device="cpu", name="input_0", dtype=types.UINT8, ndim=1)
    images_2 = fn.external_source(
        device="cpu", name="input_1", dtype=types.UINT8, ndim=1
    )
    images_3 = fn.external_source(
        device="cpu", name="input_2", dtype=types.UINT8, ndim=1
    )
    images = fn.experimental.decoders.image(
        images,
        device="mixed",
        dtype=types.UINT16,
    )
    images_2 = fn.experimental.decoders.image(
        images_2, device="mixed", dtype=types.UINT16
    )
    images_3 = fn.experimental.decoders.image(
        images_3, device="mixed", dtype=types.UINT16
    )

    images = fn.transpose(images, perm=[2, 0, 1])
    images_2 = fn.transpose(images_2, perm=[2, 0, 1])
    images_3 = fn.transpose(images_3, perm=[2, 0, 1])
    images = fn.cast([images], dtype=types.FLOAT)
    images_2 = fn.cast([images_2], dtype=types.FLOAT)
    images_3 = fn.cast([images_3], dtype=types.FLOAT)
    image_max_value = fn.reductions.max(images)
    normalization_value = set_normalization_value(image_max_value)
    images /= normalization_value
    images_2 /= normalization_value
    images_3 /= normalization_value
    images = fn.stack(*[images, images_2, images_3], axis=0)
    return images

pipe = decode_pipeline()
pipe.serialize(filename="model.dali")

This is how the pipeline was generated. I don't think the pipeline is the problem, rather the DALI backend not cleaning up the memory.

To reproduce this problem:

  1. Launch Triton Inference Server with DALI nightly backend
  2. Load DALI model explicitly
  3. Make inference request to the said DALI model
  4. Unload the model
  5. Repeat steps 2.-5.

Is there a way to limit the memory growth (because 7Gb over the baseline is too much) or fix this issue? I want to decode 3x5000x10000x3x2(size of uint16) images - it should be around 900Mb of pure data.

szalpal commented 1 year ago

Hi @appearancefnp ,

thanks for reaching out. I'm not sure I understand correctly what you'd like to do, but are images, images_2 and images_3 supposed to create a batch?

In DALI, batches are implicit. That means, that such DALI pipeline:

@pipeline_def(
    batch_size=3,
    num_threads=1,
    device_id=0,
    output_dtype=[types.FLOAT],
    output_ndim=[4],  # Dimensions of image, not including batch dimension
)
def decode_pipeline():
    images = fn.external_source(device="cpu", name="input_0", dtype=types.UINT8, ndim=1)
    images = fn.experimental.decoders.image(
        images,
        device="mixed",
        dtype=types.UINT16,
    )

    images = fn.transpose(images, perm=[2, 0, 1])
    images = fn.cast([images], dtype=types.FLOAT)
    image_max_value = fn.reductions.max(images)
    normalization_value = set_normalization_value(image_max_value)
    images /= normalization_value
    return images

already works on batch of 3 images. Having images_2 and images_3 very likely boosts the memory consumption and is not necessary for batch processing.

Also, please correct me if I'm wrong, but the TIFFs you're working with - 3x5000x10000x3x2(size of uint16) sum up to about 1.8GB of data per batch (I assumed, that the 3 at the beginning of the shape is the batch dimension)? If so, after adding some additional memory for the fn.transpose, the amount of memory looks legit. If you'll remove the images_2 and images_3 from the pipeline, the amount of memory should lower to about 2.3GB

Lastly, about the loading/unloading/memory consumption. DALI uses the lazy-allocation model. It means, that when the DALI pipeline is fed with data, DALI tries to handle the input data with existing memory. If whatever DALI has already allocated is not enough, DALI allocates additional memory. Naturally, this process will grow asymptotically and will plateau on the size of memory, which is required to handle the biggest batch possible. For example, if my dataset contains images of various sizes, but the largest one is 1920x1080x3 (uint8), then for batch_size=7 simple DALI decoding pipeline will plateau on about 43MB.

Unloading DALI correctly frees the allocated memory, but as an optimization, a given DALI pipeline when loaded will allocate the same amount of memory it freed before. Since allocations are one of the most expensive operations, this helps in avoiding the warmup phase after unloading/loading DALI pipeline. Is that OK with you? Or would your use-case require starting the warmup from scratch?

Hopefully my explanation here was clear. If you have any other questions, don't hesitate to ask :)

appearancefnp commented 1 year ago

Thanks for the reply!

I know the pipeline looks weird, but my model input consists of three RGB images. I know it's a weird way to do it, but currently it is designed this way. But this is not the problem I want to address right now.

The problem is that after unloading the models, i would want to free the memory.

image

After unloading DALI models, the triton inference server keeps the memory and does not release it. If I could unload it completely and load it again it would be great. The model warmup is not a problem :) The GPU memory is the problem! Because for my use case, I want to free the GPU memory from the inference server and use it else where. So yes, I want the warmup from scrath :)

szalpal commented 1 year ago

I see. This actually might be a bug, I'd need to check this out. Thanks for reporting and the repro. I'll be posting status updates here.

Cheers!

szalpal commented 1 year ago

I did some more research on this topic. Generally, it's not a bug, it's a feature.

Our intention in DALI was not to deallocate GPU memory, virtually ever (it's freed it after process is killed). The reason is that we're keeping a pool of GPU memory, shared by all DALI Pipelines per a given process and creating subsequent Pipeline object is way cheaper, when the memory is already allocated. The peaks marked by arrows in the image above are rather a something unwanted in this behaviour - these are the parts of external libraries, where we cannot control the memory allocation.

That being said, I believe that the use-case presented above is a valid point to ask and a legit reason to introduce a possibility to actually free the GPU memory. We'll introduce such possibility in DALI and DALI Backend. I'll be posting the status updates here.

appearancefnp commented 1 year ago

@szalpal Thanks for the updates! I know it's expensive to reallocate GPU memory in terms of time, but if it's an optional configuration setting, the would be great!

Cheers!

nrgsy commented 7 months ago

That being said, I believe that the use-case presented above is a valid point to ask and a legit reason to introduce a possibility to actually free the GPU memory. We'll introduce such possibility in DALI and DALI Backend. I'll be posting the status updates here.

@szalpal, curious if this feature ever got added? My team ran into this issue recently and thought it was a bug. We were creating and destroying cuda shared memory regions many times sequentially in the same process, and saw GPU memory usage increase until we ran out of memory. This did not happen prior to our switch to dali_backend (we are using dali_backend for image preprocessing, which was previously done before writing the image to shared GPU memory). Our proposed fix is to avoid creating and destroying shared memory many times in the same process, but would be good to know if there is a way to avoid increasing memory usage and warmup from scratch instead.

szalpal commented 7 months ago

@nrgsy ,

We did not add it to DALI Backend, however I believe required functionality exists already in DALI, therefore I'll create a PR adding it. Thank you for bringing attention to this.

szalpal commented 7 months ago

@nrgsy , @appearancefnp ,

The PR is merged. You can expect the feature in next Triton release.