Avoid Large Allocations via Pre-Allocated Buffers and Buffer Pooling

adamkvd commented 5 years ago

Summary

Games and other realtime applications have a strict time and memory budget that easily suffers from GC pauses and unnecessary allocations. Right now, loading a DDS or TGA will cause plenty of large new byte[] allocations which can get massive for large textures.

Details

Every load method should provide an option to write its data into a pre-allocated byte[] buffer and return the number of bytes used. This can avoid an additional copy and conserve memory by placing memory ownership and responsibility at the callsite.
Internal byte[] allocations should be avoided, for example by using a pool implementation where a byte[] of at least requested size X is rented and returned after use.
- It should be possible for the library user to supply an implementation themselves via PfimConfig, so it could make sense to define an IBufferAllocator interface and a private DefaultBufferAllocator implementation.
- The default implementation could just return a new byte[] on every call. Later on, it could be replaced by something more sophisticated, but as long as user-provided implementations are possible, this is nice2have.
- A user implementation could provide a pooled solution that integrates with the specific environment they are using.
Other allocations are probably small enough to ignore for now.
For compatibility reasons, it would be nice if Span<T> and Memory<T> could be avoided for now, even though they're quite close to the use case.

nickbabcock commented 5 years ago

Yeah I can definitely get behind this.

Are you thinking something along the lines of

interface IImageAllocator
{
    /// <summary>
    /// Allocate the data to hold pixel data. This will be the same
    /// data that is returned through `IImage.Data`
    /// Length of returned data can exceed size requested.
    /// </summary>
    byte[] AllocateData(int size);

    /// <summary>
    /// Allocate a buffer that is used temporarily during image decoding. All buffers
    /// allocated with rent will be returned by the end of the decoding process.
    /// Length of returned data can exceed size requested.
    /// </summary>
    byte[] Rent(int size);

    /// <summary>
    /// Returns a buffer to the pool that was previously obtained by rent.
    /// </summary>
    void Return(byte[] data);
}

And then add IImage.DataLength to workaround when allocated data exceeds size requested.

This way there is separation between short term and long term allocations (but both can be backed by an pool)

EDIT: Not 100% sure where potential MipMap data should be allocated, I suppose under AllocateData 🤔

adamkvd commented 5 years ago

EDIT: Not 100% sure where potential MipMap data should be allocated, I suppose under AllocateData

From my own perspective, just put it continuously in the same buffer as the full size image - calculating where exactly each mip level starts isn't a big deal if I know the color / data format, and it's efficient and probably cache-friendly too.

Linking issue #45 for clarity, if anyone else reads along.

Are you thinking something along the lines of ...

Yep, pretty much!

One thing I'd approach differently is to keep the allocator agnostic of the purpose of the allocation, i.e. not have an AllocateData method, just the Rent and Return. The data buffer for storing the final image data into could instead be passed as a parameter of the create / load method. This way, the connection between the call and the buffer used for that specific image data is explicit, reflecting the "please store this data here" use case.

Expanding on this, if there were parameters for both buffer and offset, multiple images could be stored in the same continuous memory. Even without the offset parameter, it would already be super helpful though!

nickbabcock commented 5 years ago

The data buffer for storing the final image data into could instead be passed as a parameter of the create / load method

This seems too specialized of a use case. One would have to guess at a buffer that is large enough, but not too large. Seems too cumbersome. I'd rather call out for how much data is needed, so there isn't wasted space (or not enough space). (yeah Span and Memory would fit this use case nicely).

Expanding on this, if there were parameters for both buffer and offset, multiple images could be stored in the same continuous memory.

Interesting, is this suggestion born out of just performance? I'd hope that amortizing allocations via a pool would eliminate any need to store all images in a single buffer.

One thing I'd approach differently is to keep the allocator agnostic of the purpose of the allocation, i.e. not have an AllocateData method, just the Rent and Return

Yeah good point. Couple of additional thoughts passed my mind:

Should Rent / Return be generic over T like ArrayPool so other types of arrays allocations can (potentially) be amortized?
I am unsure if I will bother pooling small arrays (< 100 elements) as others have noted poor performance at small sizes (though I suppose I can leave this decision entirely up to the user like so:)

class PfimAllocator : IImageAllocator
{
    private readonly ArrayPool<byte> _shared = ArrayPool<byte>.Shared;

    public byte[] Rent(int size)
    {
        return size < 100 ? new byte[size] : _shared.Rent(size);
    }

    public void Return(byte[] data)
    {
        if (data.Length >= 100)
        {
            _shared.Return(data);
        }
    }
}

Should IImage be made to implement IDisposable so that the data array (and other arrays throughout) can be returned on disposable (and thus amortized). This seems to have some precedence already in ImageSharp

Krakean commented 5 years ago

yeah Span and Memory would fit this use case nicely

+1 for Span/Memory. At least, as #ifdef NETCORE...

adamkvd commented 5 years ago

The data buffer for storing the final image data into could instead be passed as a parameter of the create / load method

This seems too specialized of a use case. One would have to guess at a buffer that is large enough, but not too large.

That is a good point. There could be mechanisms to handle this, but they might not be very intuitive.

The use case I was coming from is that the decoded image would be put into a shared buffer, processed a little further, and then moved to its final location. I would know the maximum supported texture size in advance and would then be able to calculate the maximum supported buffer size.

That said, the allocator pattern would work as well - my other suggestion might have been too close to the specific case here.

Expanding on this, if there were parameters for both buffer and offset, multiple images could be stored in the same continuous memory.

Interesting, is this suggestion born out of just performance? I'd hope that amortizing allocations via a pool would eliminate any need to store all images in a single buffer.

Somewhat yes, but more along the lines of a fuzzy "more control = good" mindset when it comes to allowing devs to find the most performant approach to their use case, and because it would have been a low-hanging fruit in the pre-allocated target buffer parameter approach. It doesn't fit the allocator design though, and it would be obsolete in a future Span / Memory world, so it might be best to ignore this idea for now to keep things simple.

Should Rent / Return be generic over T like ArrayPool so other types of arrays allocations can (potentially) be amortized?

Not sure. My gut feeling would say "no" to keep it close to memory as opposed to higher level views on that memory, but that's certainly not the only way to approach this. On the plus side, a memory-specific design would simplify its implementation and internal storage.

I am unsure if I will bother pooling small arrays (< 100 elements) as others have noted poor performance at small sizes (though I suppose I can leave this decision entirely up to the user like so:)

👍

Should IImage be made to implement IDisposable so that the data array (and other arrays throughout) can be returned on disposable (and thus amortized). This seems to have some precedence already in ImageSharp

One use case is that the returned IImage is queried for image metadata and then discarded as its content is transformed or forwarded to an engine-specific data structure. Where possible, an extra copy of the data buffer will be avoided. However, if discarding the IImage would imply returning the buffer, this would no longer be possible.

Sure, I can just not call Dispose, but that's not what best practices suggest I do with an object implementing IDisposable. I could also keep the IImage around, but that could turn out cumbersome where Pfim is only one of various texture loader libs and generally isolated within a format-specific implementation.

One approach to solve this could be to let it implement IDisposable, but also provide a Detach method that explicitly decouples the buffer from the IImage. That way the default behavior would be to return the buffer, but I could get around this in a documented way if my use case demanded it.

nickbabcock commented 5 years ago

Created #50, let me know of any thoughts

adamkvd commented 5 years ago

Looking good 👍

nickbabcock commented 5 years ago

Pfim 0.7 has been released, allowing buffer pooling. Detaching a buffer seems to be slightly tangential so I've opened a new issue for further discussion.

nickbabcock / Pfim

Avoid Large Allocations via Pre-Allocated Buffers and Buffer Pooling #47

Summary

Details