m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.38k stars 117 forks source link

PageLockScope issue #691

Closed yuryGotham closed 9 months ago

yuryGotham commented 2 years ago

Not sure if it is a bug or a missing feature, but it appears to be impossible for multiple GPUs to use the same pinned memory on host.

Code:

            var startTime = DateTime.Now;
            double timeSpent;
            void LogTime(string s) { timeSpent = DateTime.Now.Subtract(startTime).TotalSeconds; Console.WriteLine($"{DateTime.Now}\tTime to {s}: {timeSpent:0.0000}s"); startTime = DateTime.Now; }

            var context = Context.Create().AllAccelerators().EnableAlgorithms().ToContext();
            LogTime("create context with " + context.GetCudaDevices().Count + " gpus");

            var accelerators = Enumerable.Range(0, context.GetCudaDevices().Count).Select(i => context.CreateCudaAccelerator(i)).ToArray();

            int size = 1024;
            var cpuShared = new float[size];
            var intPtr = GCHandle.Alloc(cpuShared, GCHandleType.Pinned).AddrOfPinnedObject();
            var scopes = new PageLockScope<float>[accelerators.Length];
            LogTime("create cpu array");

            for(int accId = 0; accId < accelerators.Length; accId++) {
                scopes[accId] = accelerators[accId].CreatePageLockFromPinned<float>(cpuShared);
                LogTime("create scope for accelerator " + accId);
            }

Output:

Unhandled exception. ILGPU.Runtime.Cuda.CudaException: part or all of the requested memory range is already mapped
   at ILGPU.Runtime.CudaPageLockScope`1..ctor(CudaAccelerator accelerator, IntPtr hostPtr, Int64 numElements)
   at ILGPU.Runtime.Cuda.CudaAccelerator.CreatePageLockFromPinnedInternal[T](IntPtr pinned, Int64 numElements)
   at ILGPU.Runtime.Accelerator.CreatePageLockFromPinned[T](IntPtr pinned, Int64 numElements)
   at ILGPU.Runtime.Accelerator.CreatePageLockFromPinned[T](T[] pinned)

I suspect that CreatePageLockFromPinned() is trying call cudaHostRegister a second time when it does not need to - it only needs to call cudaHostGetDevicePointer for subsequent accelerators.

MoFtZ commented 2 years ago

hi @yuryGotham.

ILGPU does not keep track of the memory range that is being locked - it is the responsibility of the calling application to manage this, and ensure that the same memory range is not locked twice.

Please note that ILGPU is using the cudaHostRegisterPortable flag when making the call to cudaHostRegister. This means that the device pointer "returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation". Therefore, it is not necessary to call page lock on each accelerator.

yuryGotham commented 2 years ago

Hi @MoFtZ,

It looks like I'm able to get the pinned memory to work for the most part. One thing that I'm still unable to figure out is how to do a partial copy between MemoryBuffer and PageLockScope.

In my scenario PageLockScopes are created ahead of time and used many times (creating a scope is slow). The amount of memory that I need to copy from PageLockScope to MemoryBuffer varies during runtime. I can dynamically point to a region of MemoryBuffer by using View.SubView(), but I cannot point to a region of PageLockScope and creating a new PageLockScope each time is VERY slow.

Is it possible to add the functionality to point to a part of PageLockScope? Even being able to point to first X bytes of PageLockScope would solve my specific problem.

MoFtZ commented 2 years ago

@yuryGotham @m4rs-mt I have raised a PR to allow copying to/from a subset of a page locked array. Instead of adding lots of new methods, I have simplified the API to use the existing CopyFrom and CopyTo ArrayView methods.

The PageLockedScope class now has an ArrayView property to allow it to be used with the existing copy methods.

MoFtZ commented 2 years ago

@yuryGotham now that #697 has been postponed, you can use this workaround in ILGPU v1.0.

using var memoryBuffer = CPUMemoryBuffer.Create(
    accelerators[accId],
    scopes[accId].AddrOfLockedObject,
    scopes[accId].Length,
    Interop.SizeOf<float>());

var arrayView = memoryBuffer.AsArrayView<float>(0L, memoryBuffer.Length);

This creates a memory buffer over the page locked array. And from the memory buffer, we get an array view. This array view can then be sliced and used in copy operations.

yuryGotham commented 2 years ago

@MoFtZ how do I copy the arrayView to gpu using locked/async method? I'm currently calling gpuBuffer.View.CopyToPageLockedAsync(stream, pageLockedScope). CopyToPageLockedAsync does not accept ArrayView as an argument - is there a way to cast ArrayView to PageLockedArray?

MoFtZ commented 2 years ago

@yuryGotham once the CPU buffer is page-locked, you can just copy to/from the buffer as normal.

Here is a sample program:

static void Main()
{
    using var context = Context.CreateDefault();
    using var accelerator = context.CreateCudaAccelerator(0);

    // Create a pinned CPU buffer.
    int size = 1024;
    var cpuShared = new float[size];
    var gcHandle = GCHandle.Alloc(cpuShared, GCHandleType.Pinned);
    try
    {
        // Page lock the pinned CPU buffer.
        using var pageLockScope = accelerator.CreatePageLockFromPinned(cpuShared);

        // Wrap CPU buffer in an array view.
        using var cpuMemoryBuffer = CPUMemoryBuffer.Create(
            accelerator,
            pageLockScope.AddrOfLockedObject,
            pageLockScope.Length,
            Interop.SizeOf<float>());
        var cpuArrayView = cpuMemoryBuffer.AsArrayView<float>(0, cpuMemoryBuffer.Length);

        // Allocate a GPU buffer.
        using var gpuBuffer = accelerator.Allocate1D<float>(256);

        // Copy a subset of the pinned/page-locked CPU view to the GPU.
        var sourceView = cpuArrayView.SubView(32, 256);
        var destinationView = gpuBuffer.View;

        // NOTE: CopyFrom does not have an implicit call to stream.Synchronize.
        destinationView.CopyFrom(sourceView);
        accelerator.Synchronize();
    }
    finally
    {
        gcHandle.Free();
    }
}