luxonis / XLink

A cross-platform library for communicating with devices over various physical links.
Apache License 2.0
13 stars 18 forks source link

Better memory management #22

Open themarpe opened 2 years ago

themarpe commented 2 years ago

To provide better memory management the following changes are proposed:

diablodale commented 2 years ago

In thinking on this specific issue... Is this related to https://github.com/luxonis/depthai-core/issues/253?

When NO...

I see no harm in adding a custom allocator for XLinkPlatformAllocateData but I do not see great value. What is the use case?

For an image frame...the current depthai+xlink codebase sends a single big chuck of data from device to host. There are probably smaller USB packets and definitely smaller IP packets for PoE yet eventually those are transmitted and collected into a big chunk. This is my understanding of the code so far. This chuck contains two blocks within it. For example...an RGB camera capture...

This whole chuck (image frame+metadata) is one allocation of memory. The image frame block and the metadata block are both inconsistent sizes. Therefore the whole chuck is an inconsistent size... meaning... the memory allocations by XLinkPlatformAllocateData (default or custom) will be of inconsistent sizes. Sidebar: A custom memory allocator+pool is less useful in this situation because the allocations would be of random sizes. There might be a few sizes that are more common...but without doing realworld testing to build hitrates and histograms...I am skeptical.

And...the allocations are not just the image data. The chucks are image data+metadata. And the layout is "depthai-specific". This means that whatever digests this single chuck of allocated memory has to support "depthai-specific" memory layout. The only code that supports "depthai-specific" layout is depthai itself. Read on for how this might relate to OpenCV...

When YES...

OpenCV does not support the storage of image data (e.g. cv::UMat) and custom metadata in the same object. Discussed here https://github.com/opencv/opencv/issues/18792

UMat is a generic holder of a single block of allocated data. You can tell UMat to interpret this single block of memory as having 640 cols, 480 rows, and 3 bytes per entity. UMat then expects the allocated block of memory to be 640x480x3 bytes (with striding at the end of each row). This total size of col*row*eSize is crucial. UMat and all functions of OpenCV rely on this promise of total size for mem copies, image processing, OpenCL, CUDA, etc. There are some esoteric features of OpenCV allowing for custom iterators across UMat, but those would not work for GPU or VPU uses.

There is no cv::UMat accomodation for other data or metadata. There is no place to store or describe the "depthai-specific" metadata. If you were to create cv::UMat(void* startFrame, size_t frameLength) and then use OpenCV to digest and process the block of data on CPU/VPU/GPU, it is almost assured that the data and "depthai-specific" metadata will be separated, corrupted, or leaked because they are originally stored together directly in the same memory allocation.

It is possible for an app to create its own class (I made my own UMatmeta) which has its own proprietary metadata fields and inherits from cv::UMat. I can then populate UMat::data with image data, and populate my custom metadata into my class's custom fields.

Notes

If the depthai "data" and "metadata" are in the same memory allocation...in the same big chuck collected together after transmission... then I doubt the benefit of a custom allocator given the above.

If they are separated... then a custom allocator is more interesting. The metadata is random so keep using the internal XLink allocation system since the metadata and its layout is "depthai-specific". No need to expose that to the app code.

But the data as a separate isolated block of allocated memory... that is useful. Even if the size isn't the same. I can imagine my custom allocator allocating a block of OpenCL memory (via cv::UMat) for the image data and then passing that pointer to depthai+xlink so it will collect the USB packets into that block.

Or maybe there is an additional neural net on the host GPU and the app is waiting for a block of data from the device. The app's NN wants the device data to be written into a preallocated block of memory and passes that pointer to depthai+xlink. And the USB packets are then written into it.

Given those two examples (UMat, NN)... what if they are the same app. What if the app runs a NN and the app also uses OpenCV image processing. In that case, there are two custom allocators. I can easily extend this chain of thought to have more custom allocators in the same app.

themarpe commented 2 years ago

It is also related to the mentioned issue in core - but its main intention is to not require copying the memory. Like discussed in your PR, the capability of moving the data out of XLink instead of having to copy it, reduces a lot of overhead on big packets.

How this strings together with both issues is twofold:

The second one is partial in this case as as you've mentioned, you don't get to control the overall size precisely...

There might be a few sizes that are more common...but without doing realworld testing to build hitrates and histograms...I am skeptical.

As much as size of data is "guranteed" (depending on your pipeline on device side), the size of metadata is also capped. Currently XLink is at 50kiB. (to make this even more useful, this should be exposed via depthai-shared). So one could do pools relatively easily as long as they are aware of the given stream incoming messages size, by max(max(data), max(metadata)) pool element size.

So using this to be able to have your own pool allocators might be a bit harder, if the allocator was xlink global. Might make more sense it being local to stream. So if its stream local + if we expose the maximum metadata size, one could still do pools more easily.

But main reason I image is being able to allocate memory where needed at specified alignments - so if we want the incoming 300x300 BGR image to be allocated 64B aligned in certain memory region, then that is possible also.

Regarding the metadata that is at the end of data, that can be ignored* - Example:

So at the end of the day, you are left with a ImgFrame that has 300x300 BGR frame amount of data which is lets say aligned, or in some specific memory area, etc..


So to circle back, main reason IMO is mostly zero copy, then alignment and last capability of doing pools and allocating in special memory areas, etc... if that helps the consumer of the data more.

*ignored as in not needing to access/use that serialized metadata, which is an implementation detail, but using ImgFrame object to access metadata.

diablodale commented 2 years ago

The following is more discussion on challenges...

I get your intention. This is clear and easy for host-only CPU memory. It changes when GPUs are involved with APIs like OpenCL or CUDA -- additional restrictions. And likely additional code that would need to be implemented into OpenCV itself or the main app to support subbuffers (non-trivial). Example with OpenCV and OpenCL...

Imagine a depthai custom allocator. It creates a OpenCL zero-copy buffer. Usually integrated GPUs support zero-copy while discrete support low-copy (still have to copy over PCIe). Most hardware requires the start address to be 64-byte aligned, and have a total allocated size a multiple of 64 bytes. Here is Intel's requirements https://www.intel.com/content/www/us/en/developer/articles/training/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.html#inpage-nav-2

A 300x300 bgr will NOT satisfy the requirements for an OpenCL zero-copy buffer. OpenCV does not adjust striding https://github.com/opencv/opencv/blob/d24befa0bc7ef5e73bf8b1402fa1facbdbf9febb/modules/core/src/ocl.cpp#L5426. Therefore, a 300x300 bgr will result in many copies occurring within the OpenCL runtime. For this example, use instead a 320x320 bgr image.

Perhaps this is a traditional allocator+context situation. Pseudocode...

// standard depthai interface for allocator results
struct Depthai_Alloc_Return {
    void* const data;    // xlink understands where to start writing
    const size_t size;   // xlink knows bounds
    void* const context; // xlink has no knowledge, always passes to next step
}

// custom struct defined and used only within my app
struct My_App_Context {
    UMat gpuUmat;
    Mat hostMat;
};

// this function is declared/defined in my app, and a pointer to it is passed to a depthai API call that cascades into XLink
Depthai_Alloc_Return customAllocator(size_t xlinkNeedByteSize) {
    // enforce size requirement
    assert(xlinkNeedByteSize % 64 == 0);

    // ulgy raw new; requires xlink and later functions never fault without deleting this mem
    auto myContext = new My_App_Context();

    // must live beyond this custom allocator
    myContext->gpuUmat = UMat(1, xlinkNeedByteSize, CV_8UC1, OpenCL);

    // also must live beyond this allocator because it is the
    // thing that maps gpu memory to the host. On destruct, it unmaps and the host loses access
    myContext->hostMat = myBGRimage.getMat(ACCESS_WRITE);

    // where do we put the metadata?
    // all OpenCV apis (and devs) expect gpuImage to work like this
    // cv::UMat nv12gpuImage;
    // cv::cvtColor(gpuImage, nv12gpuImage, COLOR_YUV2BGR_NV12)
    //
    // submatrix on cpu memory works well, submatrix on gpu memory is fragile -- have to test

    return {
        myContext.hostMat.data,
        myContext.hostMat.size,
        myContext
    };
}

First, the issue of lifetimes of gpuImage and hostImage. And the main app must operate on those objects. They can't use a gsl::span(hostImage.data, hostImage.size). A simple pointer and size does not have all the OpenCV+OpenCV class fields needed. The code above creates a custom context and raw new to persist these classes.

Second the issue of size, striding, and the metadata. If XLink calls auto buffer = customAllocator(datasize + metadatasize); then someone has to manange striding, and position+size of the metadata so it eventually is on a "row" of its own. It has to be a row because all of the OpenCV functions operate on rows. Therefore, it is only realistically possible to append the metadata as a series of "rows" at the end of the UMat. These rows must also follow the rules of 64-byte alignment and 64-byte sizing.

But XLink has no knowledge of rows and columns. Who does this needed calculation to position/size the metadata on the end of the UMat? This has to be done before data is written into the bock of memory. And then has to be undone by depthai when it receives the packet struct from a Stream::read().

Pretend depthai parses the metadata and now passes ?something? to something in the main app which can re-constitute the OpenCV classes and deallocate.

UMat something(somethingParam param) {
    // get result block of memory that contains the image data, metadata, and context
    auto block = dai::getImage();

    My_App_Context myContext;
    try {
        myContext = *static_cast<My_App_Context*>(block.context);
    }
    catch {
        delete block.context;
        return {};
    }

    delete block.context;

    // somehow separate the image data from the metadata
    // because the custom allocator allocated all the memory together in one block
    auto submatrix = cv::UMat(myContext.gpuUmat, Range(0, block.rows * block.stride));

    // reshape UMat based on the depthai metadata
    return submatrix.reshape(block.channels, block.rows);

    // destructor for Mat and UMat within myContext should run as scope ends here
}
themarpe commented 2 years ago

It changes when GPUs are involved with APIs like OpenCL or CUDA -- additional restrictions. And likely additional code that would need to be implemented into OpenCV itself or the main app to support subbuffers (non-trivial). Example with OpenCV and OpenCL...

I'm not familiar as to how UMat/OpenCL/CUDA functions in terms of memory management and available APIs, so read the following with that in mind.

A 300x300 bgr will NOT satisfy the requirements for an OpenCL zero-copy buffer. OpenCV does not adjust striding

Correct, but, this is data issue - to resolve it we must present means to set a stride on device (or one strides the image on host, but this is a "copy operation", and its "easy" again - as we can "copy" to a needed destination memory location), to be able to eg. resize an image to 300x300 and set stride to 64. So resulting image would be: 320 300 3 (strideheightplanes) = 288000B. This will arrive to XLink, into a specified buffer. After everything is processed as I've previously mentioned, you are given a zero copy, data ptr, that points to the beginning of this image. If we allocate it 64B aligned, it can also start at a 64B aligned address.

Afterwards, one (UMat handwaving ahead), creates UMat from existing memory, and use that to compute upon, send to a GPU, etc... Same as in getFrame case, where the returned cv::Mat is only a view of the memory and not owning it per se. (If ownership is required, then we can expose a release, from the Message standpoint, so one can transfer that, IF API supports taking ownership on the other end).

As to metadata, no need to worry about it here yet - it gets transferred alongside the data, but, afterwards the only thing relevant is the final Message. In this case, that'll be eg ImgFrame with its data ptr and the rest of information. To construct cv::Mat in our case, we use ImgFrame::getWidth, ImgFrame::getHeight, ImgFrame::getType, etc.. to construct a correct cv::Mat, as well as the data ptr to either transfer to it or make a view of it. (Again, not sure about UMat, I assume one can do the same).

Above is possible IFF the recipient API supports viewing/taking ownership of an existing memory (with given limitations, it being aligned, etc...). If this isn't possible, then things get a bit harder, likely requiring a copy, to not go through all the pain otherwise. (quick example would be std::vector hah).

In UMat/OpenCL/CUDA cases is such approach possible or do we always get memory from their API and its what is available to work with? Or is the whole above premise false, as we've always just given memory by their API and have to work with that?

diablodale commented 2 years ago

All you wrote is cpu centric (which you caveat'd). The overall thinking doesn't apply in OpenCL, CUDA, and (from what I understand) Vulkan. You describe something that can already be done today with depthai-core. There is no need for a custom allocator...because depthai-core doesn't need such a feature. depthai-core can use the aligned boost vector to make the "existing memory".

Perhaps you do this in stages. Start with adding alignment and UMat support.

  1. Have depthai-core use a boost aligned vector instead of std::vector. Easy to implement, I've been running it for a few weeks. Alignment helps when data is exchanged with OpenCV as 64-byte alignment is preferred by both OpenCV and OpenCL.
  2. Have depthai write the "data" to the start of that aligned vector. This is already depthai code. No work here.
  3. Start the metadata after that and ensure it also starts on a multiple of 64 bytes. This section can be any length.
  4. Add UMat OpenCL support by copying ImgFrame::getFrame() and adding 4 lines of code. You can see some of that in the pseudocode above. It is trivial.
  5. The app is responsible for keeping the ImgFrame alive longer than the life of the UMat

Then next year, consider stage 2. From my perspective, a custom allocator is when my app provides the allocator. And my allocator provides depthai-core a pointer. And, depthai-core must write the frame data to that address. Not the opposite. No "from existing". The app owns the allocator, the app's allocator provides the pointer, therefore the app owns the memory. That's https://github.com/luxonis/depthai-core/issues/253

More on OpenCL and OpenCV

I have OpenCL experience (OpenCV's default gpu/device api) so I'll write from that perspective. I will "personify" OpenCL for brevity. And ignore the less-common scenarios.

OpenCL has a device (gpu, vpu, etc.) view of the world. By default...it is always about the device. The host (traditional cpu) only exists to service the device. When OpenCL speaks about memory, it is, by default, speaking about its own (device) memory. It is in control of memory and only deals with host-memory with explicit apis.

A block of memory...means a block of device memory. This memory is accessible only to the device. The host (cpu) has no access. Naturally, there is a method for data to be exchanged between device<->host. The device is in control of that. The following process is similar in CUDA and Vulkan. And roughly follows the pseudocode I wrote above.

OpenCL default process -> Device allocates+owns memory

  1. Host asks Device to allocate 1MB memory. This results in 1MB memory being allocated on the device...accessible only by the device. Remember throughout this process...the Device always owns that newly allocated memory.
  2. Host asks the Device to "map" that device memory into someplace accessible to the Host cpu. The Device considers the request. When granted, the Device maps (process is internal to OpenCL runtime) the 1MB of device memory to a memory location of Device's choice (Host has no control of location) that is accessible by the Host cpu.
  3. Host reads/writes bytes into the 1MB area starting at the pointer location given to it by the Device. While this is occurring, the Device is locked-out of that 1MB region. Both Device and Host can not simultaneously read/write that same 1MB area.
  4. When Host is complete, the Host requests the Device to "unmap" the 1MB region. Device grants the request and the OpenCL runtime manages the internal process to get that 1MB of data to a place accessible by the Device. While this is occurring (in parallel), the Host should never read/write to the 1MB of virtual memory addresses previously used in the map. The Host does not "own" those addresses and never "allocated" those addresses.
  5. After OpenCL runtime completes "unmap", the Device again has access to the 1MB block of device memory it allocated in step 1. This 1MB has all the byte values that were set at the beginning of step 4. The Device can now read/write to this block and run programs that use it.

That is the default process that works across the widest set of Device+Host+OS. It has most opportunities for that ecosystem to self-optimize, zero-copies, etc. And it is often the easiest for an app to use -when compared- to the code and testing needed to reach the same level of optimization for alternate data exchange methods.

[Sidebar: OpenCL has no idea of "striding". It is just a block of memory. It is the responsibility of the host code -and- the device code/kernels to manage any memory layout concerns like striding]

Alternate 1 -> Device still allocates+owns memory yet copies Host memory to Device

One alternate data exchange process is one you allude to in your post immediately above. "...creates UMat from existing memory". The OpenCL runtime will do what is requested and will make as any copies of the whole 1MB of data as needed to make it happen. Optimization is no longer a priority.

To aid in maybe optimizing (not promised) there are strict requirements that are different for each manufacturer, and even different models of the same manufacturer. Intel's requirements I linked above. Requirements like (bufStart % 64 == 0) and (bufSize % 64 == 0).

  1. Host asks Device to allocate 1MB memory and provides a pointer to Host memory previously allocated by Host. This results in 1MB memory being allocated on the Device accessible only by the Device... and... OpenCL immediately "copies" 1MB from that Host pointer to the Device. Remember throughout this process...the Device always owns that newly allocated memory.
  2. [Same] Host asks the Device to "map" that device memory into someplace accessible to the Host cpu. The Device considers the request. When granted, the Device maps (process is internal to OpenCL runtime) the 1MB of device memory to a memory location of Device's choice (Host has no control of location) that is accessible by the Host cpu.
  3. [Same] Host reads/writes bytes into the 1MB area starting at the pointer location given to it by the Device. While this is occurring, the Device is locked-out of that 1MB region. Both Device and Host can not simultaneously read/write that same 1MB area.
  4. [Same] When Host is complete, the Host requests the Device to "unmap" the 1MB region. Device grants the request and the OpenCL runtime manages the internal process to get that 1MB of data to a place accessible by the Device. While this is occurring (in parallel), the Host should never read/write to the 1MB of virtual memory addresses previously used in the map. The Host does not "own" those addresses and never "allocated" those addresses.
  5. [Same] After OpenCL runtime completes "unmap", the Device again has access to the 1MB block of device memory it allocated in step 1. This 1MB has all the byte values that were set at the beginning of step 4. The Device can now read/write to this block and run programs that use it.

Notice step 2. It is the same as the default process. Meaning... the "source" Host pointer given to OpenCL in step 1 is not the same pointer given back to the Host in step 2. It might be, it might not be. OpenCL runtime will copy data to any location it chooses when it chooses. There are many caveats and options and conditions. Host can only read/write to the pointer location the Device gives it when "map" in step 2. This is why it is best to use the default process because that has the most opportunities for optimization and no copies.

Alternate 2 -> Device still allocates+owns memory and Host wants same Host memory address

Second alternate data exchange process for "...creates UMat from existing memory". The OpenCL runtime will consider what is requested and may deny it. When the Host requests the use of the same Host memory address there is more possibility for out-of-memory scenarios. This is a special and limited kind of memory; often called "pinned memory". If too much "pinned memory" is being used (by the entire OS), then OpenCL will deny the request and the codepath can not proceed. The amount of "pinned memory" available differs based on device models, device manufacturers, device drivers, OS setting, and host hardware platform. For example, I suspect a raspberry pi has less possible pinned memory compared to a giant gaming Nvidia desktop. One shouldn't fear using pinned memory, but instead be responsible with its use.

Unfortunately, OpenCV is a carnival of code with respect to this Alternate 2 with pinned memory. It is fragile, often does needless copies of the buffer...it is a mess. I've fixed all the issues with my private OpenCV -- running great for months. I gave up working with the OpenCV maintainer (paired with me) for reasons not relevant to this discussion so the public OpenCV releases are problematic.

So this leads again to my recommendation to use the default approach at the top. OpenCV's OpenCL support still has race conditions with the mapping process, not thread-safe, etc... but with this alternative it also adds needless copies... in addition to any copies OpenCL might do itself.

[Same] To aid in maybe optimizing (not promised) there are strict requirements that are different for each manufacturer, and even different models of the same manufacturer. Intel's requirements I linked above. Requirements like (bufStart % 64 == 0) and (bufSize % 64 == 0).

  1. Host asks Device to allocate 1MB memory and provides a pointer to Host memory previously allocated by Host. This results in 1MB memory being allocated on the Device accessible only by the Device... and... OpenCL immediately "defines" the 1MB data on the Device as being "the same" as the 1MB from that Host pointer. (Internally, OpenCL might have done a copy, or it might not.) Remember throughout this process...the Device always owns that newly allocated memory...which virtually lives "on top" of the block of "existing" Host-allocated memory.
  2. Host asks the Device to "map" that device memory into the same Host pointer given in step 1. The Device considers the request. When granted, the Device maps (process is internal to OpenCL runtime) the 1MB of device memory to that same memory location that is accessible by the Host cpu.
  3. [Same] Host reads/writes bytes into the 1MB area starting at the pointer location given to it by the Device. While this is occurring, the Device is locked-out of that 1MB region. Both Device and Host can not simultaneously read/write that same 1MB area.
  4. [Same] When Host is complete, the Host requests the Device to "unmap" the 1MB region. Device grants the request and the OpenCL runtime manages the internal process to get that 1MB of data to a place accessible by the Device. While this is occurring (in parallel), the Host should never read/write to the 1MB of virtual memory addresses previously used in the map. OpenCV has bugs in this area and does not always block here, so is important that Host doesn't operate on this memory while Device is "unmap/migrating" the data to the Device's memory.
  5. [Same] After OpenCL runtime completes "unmap", the Device again has access to the 1MB block of device memory it allocated in step 1. This 1MB has all the byte values that were set at the beginning of step 4. The Device can now read/write to this block and run programs that use it.
  6. Host also has access to the host-side cpu 1MB memory it previously allocated. The Host cpu can call free().
themarpe commented 2 years ago

Thanks for the insight

As you've mentioned, I agree is best split into 2 stages. Initially I'd like to get the zero copy in along with 64B alignment by default without having to set a custom allocator.

Afterwards to improve the OpenCL, CUDA and other usecases, we can move into making it possible to do a zero copy into a provided memory buffer by one of the mentioned APIs (as long as the unnecessary already consumed serialized metadata at the end of the buffer won't cause issues).

When stage 1 is complete, we immediately get 1 copy reduction even in OpenCL/CUDA usecases, as we can then do 1 remaining copy into the needed buffer. Also in cases where good optimizations from these APIs aren't available, and memory transfer to device is made anyway, this would yield same "magnitude of performance".

So, I think a great step forward in both cases, having stage 1 done.

Although for complete proper integration later in stage 2, think might have to be redone a bit - we'll see about that afterwards.

I'll try to get a PoC of zero copy done by the end of the week. Will most likely integrate your changes in #29 and modify some parts in develop_refactor to move away from std::vector to eg StreamPacketDesc of sorts. You've also mentioned that you have a PR on core side ready as well? Might make sense to sync on those as I start moving things around in develop_refactor. Thoughts?

diablodale commented 2 years ago

Zero Copy definition?

I think we are both saying "zero copy" but meaning slightly different things. I'd like to check and confirm...

"zero copy" to me means there are no bulk full image frame copies on the host or the OpenCL device.

Is that what you mean when you write "zero copy"?

I'm ok with stage 1. Meaning depthai+xlink does the single host cpu memory allocation with the 64/64 restrictions. I would like to see how the deallocation will work. My hope is that the dai::ImgFrame will do it for me. So if I keep the ImgFrame alive, then the allocation is valid. And when my app causes the ImgFrame to go out of scope, then depthai will deallocate the memory.

My code for threads, move, and alignment

I'll push branches (not PRs). I can't do PRs because both of these need the Xlink move semantics from PR #29. When/if that is merged in, then I can open PRs.

themarpe commented 2 years ago

Is that what you mean when you write "zero copy"?

I do - but, was using it wrongly hah. Did some additional digging around yesterday and it turns out, we could be doing zero copy from USB DMAd -> XLink as well :D Which in turn could be USB DMAd -> "GPU memory". But turns out its kinda out of scope, as there are too many restrictions to go along. (memory management would get hard fast)

Anyway, the "zerocopy" in mind here is the XLink->core and core parsing scope. So as you mentioned above.

For stage 1, I intend to move the "std::vector" out of RawBuffer - and put a StreamPacketDesc (or equivalent that controls the XLink memory) into Buffer. Buffer can then expose more user friendly std::vector like interfaces to access that ptr+length, and once it goes out of scope it deallocates the original XLink buffer.

And as part of that StreamPacketParser doesn't have to copy the data, it just deserializes the last bits of the buffer into aan arrived type Message.

Regarding alignment and data size, XLink by default already does 64B address and 64B size aligned, so that will be handled as well.

Thanks a lot for the commits - I've already integrated the #29 into a temporary develop_refactor where I'll be putting everything together and will keep integrating with the rest until stage 1 gets into a good shape.

diablodale commented 2 years ago

got it. Since there are STL "container" standards/rules, perhaps take a 2nd look at gsl::span. Maybe there is a way to provide that back to callers needing the data, or inherit from it so the Buffer itself is the container following those rules. https://en.cppreference.com/w/cpp/named_req/Container

Loop me in if/when you want my look at something. 👍