Closed appearancefnp closed 7 months ago
Hi @appearancefnp ,
thanks for reaching out. I'm not sure I understand correctly what you'd like to do, but are images, images_2 and images_3
supposed to create a batch?
In DALI, batches are implicit. That means, that such DALI pipeline:
@pipeline_def(
batch_size=3,
num_threads=1,
device_id=0,
output_dtype=[types.FLOAT],
output_ndim=[4], # Dimensions of image, not including batch dimension
)
def decode_pipeline():
images = fn.external_source(device="cpu", name="input_0", dtype=types.UINT8, ndim=1)
images = fn.experimental.decoders.image(
images,
device="mixed",
dtype=types.UINT16,
)
images = fn.transpose(images, perm=[2, 0, 1])
images = fn.cast([images], dtype=types.FLOAT)
image_max_value = fn.reductions.max(images)
normalization_value = set_normalization_value(image_max_value)
images /= normalization_value
return images
already works on batch of 3 images. Having images_2
and images_3
very likely boosts the memory consumption and is not necessary for batch processing.
Also, please correct me if I'm wrong, but the TIFFs you're working with - 3x5000x10000x3x2(size of uint16)
sum up to about 1.8GB
of data per batch (I assumed, that the 3
at the beginning of the shape is the batch dimension)? If so, after adding some additional memory for the fn.transpose
, the amount of memory looks legit. If you'll remove the images_2
and images_3
from the pipeline, the amount of memory should lower to about 2.3GB
Lastly, about the loading/unloading/memory consumption. DALI uses the lazy-allocation model. It means, that when the DALI pipeline is fed with data, DALI tries to handle the input data with existing memory. If whatever DALI has already allocated is not enough, DALI allocates additional memory. Naturally, this process will grow asymptotically and will plateau on the size of memory, which is required to handle the biggest batch possible. For example, if my dataset contains images of various sizes, but the largest one is 1920x1080x3 (uint8)
, then for batch_size=7
simple DALI decoding pipeline will plateau on about 43MB.
Unloading DALI correctly frees the allocated memory, but as an optimization, a given DALI pipeline when loaded will allocate the same amount of memory it freed before. Since allocations are one of the most expensive operations, this helps in avoiding the warmup
phase after unloading/loading DALI pipeline. Is that OK with you? Or would your use-case require starting the warmup
from scratch?
Hopefully my explanation here was clear. If you have any other questions, don't hesitate to ask :)
Thanks for the reply!
I know the pipeline looks weird, but my model input consists of three RGB images. I know it's a weird way to do it, but currently it is designed this way. But this is not the problem I want to address right now.
The problem is that after unloading the models, i would want to free the memory.
After unloading DALI models, the triton inference server keeps the memory and does not release it. If I could unload it completely and load it again it would be great. The model warmup is not a problem :) The GPU memory is the problem! Because for my use case, I want to free the GPU memory from the inference server and use it else where. So yes, I want the warmup from scrath :)
I see. This actually might be a bug, I'd need to check this out. Thanks for reporting and the repro. I'll be posting status updates here.
Cheers!
I did some more research on this topic. Generally, it's not a bug, it's a feature.
Our intention in DALI was not to deallocate GPU memory, virtually ever (it's freed it after process is killed). The reason is that we're keeping a pool of GPU memory, shared by all DALI Pipelines per a given process and creating subsequent Pipeline object is way cheaper, when the memory is already allocated. The peaks marked by arrows in the image above are rather a something unwanted in this behaviour - these are the parts of external libraries, where we cannot control the memory allocation.
That being said, I believe that the use-case presented above is a valid point to ask and a legit reason to introduce a possibility to actually free the GPU memory. We'll introduce such possibility in DALI and DALI Backend. I'll be posting the status updates here.
@szalpal Thanks for the updates! I know it's expensive to reallocate GPU memory in terms of time, but if it's an optional configuration setting, the would be great!
Cheers!
That being said, I believe that the use-case presented above is a valid point to ask and a legit reason to introduce a possibility to actually free the GPU memory. We'll introduce such possibility in DALI and DALI Backend. I'll be posting the status updates here.
@szalpal, curious if this feature ever got added? My team ran into this issue recently and thought it was a bug. We were creating and destroying cuda shared memory regions many times sequentially in the same process, and saw GPU memory usage increase until we ran out of memory. This did not happen prior to our switch to dali_backend (we are using dali_backend for image preprocessing, which was previously done before writing the image to shared GPU memory). Our proposed fix is to avoid creating and destroying shared memory many times in the same process, but would be good to know if there is a way to avoid increasing memory usage and warmup from scratch instead.
@nrgsy ,
We did not add it to DALI Backend, however I believe required functionality exists already in DALI, therefore I'll create a PR adding it. Thank you for bringing attention to this.
@nrgsy , @appearancefnp ,
The PR is merged. You can expect the feature in next Triton release.
Hello!
I was excited for the 16-bit TIFF decoding, but there is a bug regarding the DALI backend not releasing memory when model is unloaded. Even when you unload the DALI model and load it again, it will consume more memory than previously. Although it converges to a fixed number, it is very high - 7Gb for batch size 3.
In the image, you can see, that when a DALI model is loaded, the memory increases, which is fine. The dip is when the model is unloaded. It doesn't release the memory to the initial minimum. When loading the model again, the memory increases again. I've done this model "reloading" multiple times and after some time the memory growth stops.
DALI version: 1.22.0dev, but I think the problem persists with older versions too.
This is how the pipeline was generated. I don't think the pipeline is the problem, rather the DALI backend not cleaning up the memory.
To reproduce this problem:
Is there a way to limit the memory growth (because 7Gb over the baseline is too much) or fix this issue? I want to decode 3x5000x10000x3x2(size of uint16) images - it should be around 900Mb of pure data.