Removes the TRITONCACHE_EntryItem concept, flattening TRITONCACHE_Entry into a list of buffers. This simplifies the set of C APIs to maintain, and simplifies the allocation/eviction logic within the cache.
Each buffer will correspond to a serialized representation of an InferenceResponse in the context of Triton. This buffer will contain all of the info needed to deserialize each InferenceResponse::Output for each reponse
Allocator/Copy callback was added to Insert to avoid an unnecessary copy when serializing a response
Triton will send the requested buffer size with each buffer in the TRITONCACHE_Entry object to the cache. The cache will then allocate a buffer of the requested size in the cache, and use TRITONCACHE_EntrySetBuffer to the newly allocated address. Then the TRITONCACHE_Copy / TRITONCACHE_Allocator callback combo will serialize each response into the corresponding cache-allocated buffer
Allocator/Copy callback was added to Lookup to avoid an unnecessary copy when passing cache contents back to Triton.
Since Triton needs to know the metadata related to a response in order to allocate a corresponding response buffer, the callback will atomically send the cached data back to Triton to be deserialized (to get the metadata) and copied directly into the response buffer, before returning from CacheLookup.
Removes the TRITONCACHE_EntryItem concept, flattening TRITONCACHE_Entry into a list of buffers. This simplifies the set of C APIs to maintain, and simplifies the allocation/eviction logic within the cache.
Allocator/Copy callback was added to Insert to avoid an unnecessary copy when serializing a response
Allocator/Copy callback was added to Lookup to avoid an unnecessary copy when passing cache contents back to Triton.
Corresponding core pr: https://github.com/triton-inference-server/core/pull/167