kunzmi / managedCuda

ManagedCUDA aims an easy integration of NVidia's CUDA in .net applications written in C#, Visual Basic or any other .net language.
Other
449 stars 80 forks source link

Generic interop functions exceed array bounds for some value types due to managed/unmanaged size mismatches #56

Open mrakgr opened 6 years ago

mrakgr commented 6 years ago
public void CopyToHost(float[] aDest, CUdeviceptr aSource)
{
    if (disposed) throw new ObjectDisposedException(this.ToString());
    CUResult res;
    res = DriverAPINativeMethods.SynchronousMemcpy_v2.cuMemcpyDtoH_v2(aDest, aSource, aDest.LongLength * sizeof(float));
    Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cuMemcpyDtoH_v2", res));
    if (res != CUResult.Success)
        throw new CudaException(res);
}

The generic one does, but the autogenerated ones like the above don't. This actually gave me quite a lot of trouble. Please fix this as soon as possible.

Also if you've got the time, check out the Spiral language. At this point I am a long time user of ManagedCuda and I am using it for all the Cuda interop, so it might as well be as good time as in to say thanks. The decision of whether to compile to .NET would have been a lot harder for me without it existing.

kunzmi commented 6 years ago

The array aDest is not pinned here by managedCuda because a .net-array gets automatically pinned and released by the CLR when going through P/Invoke. The generic arrays only get manually pinned in oder to avoid too many definitions of the cuMemcpyDtoH-function and replace them by a single IntPtr. My first thought when reading your posts was, that most likely the allocated size of aSource doesn't match to aDest. But this should be properly handled by the Cuda-API? What happens if you use CopyToHost<float>() instead? Does this work as it should?

For debugging and checking I'd add something like:

float[] dest = new float[100]; //your destination array of desired size
CUdeviceptr source = xyz; //your device pointer allocated somewhere

//Create a CudaDeviceVariable for size checking:
CudaDeviceVariable<float> test = new CudaDeviceVariable<float>(source); //gets array size through Cuda API

if (test.Size != 100) do something to complain...

//copy directly using CudaDeviceVariable
test.CopyToHost(dest);

//copy using CudaContext-method:
ctx.CopyToHost(dest, source);

//test is not the owner of source, so no need to dispose or what so ever...
mrakgr commented 6 years ago

Before I do as you suggest, let me just bring up a point that has been bothering me for about two years now as the situation is certainly appropriate for it.

The DeviceToHost function has always been suspicious to me. Back when I was making the very first ML library, I ran into some issues where I would transfer an device array to host and when I printed it out, I found that some of the time, half of its fields would be zeroes out. I managed to 'solve' that issue by putting DeviceSynchronize after the call, but that always bothered me a little because the transfer to host function was already supposed to be synchronous.

But if I assume that it was due to the GC janking the array in the middle of the transfer then that experience starts to make sense.

Back then I was using the CudaDeviceVariable class, so I'll start by checking whether those class' functions are doing pinning.

Let me go through your suggestions.

mrakgr commented 6 years ago

It seems that assigning the blame to the float32 function was too early. The function that is doing the corruption is the generic one in fact which copies to a bool array.

At any rate, I've figured it out. You were right about this being about array lengths.

open ManagedCuda
open ManagedCuda.BasicTypes

let ctx = new CudaContext()

let size = 10240
let x = new CudaDeviceVariable<bool>(SizeT size)
x.SizeInBytes |> printfn "%A" // is size * 4

for i=1 to 10 do
    let y = Array.zeroCreate<bool>(size*3) // It does work with x4 though, but not with x1, x2 and x3 as shown here.
    x.CopyToHost(y)

F# arrays are 1 byte per bool, but here the CudaDeviceVariable array is clearly 4 bytes per bool. The error does not show up when the array is really small, but it does with larger values of it. I wonder why the CudaDeviceVariable takes 4 bytes per bool?

mrakgr commented 6 years ago

I can't replicate the CopyToHost synchronization bug I had two years ago - maybe it got fixed in the interim? At any rate, the cause of this issue is crystal clear to me now.

public CudaDeviceVariable(SizeT size)
{
    _devPtr = new CUdeviceptr();
    _size = size;
    _typeSize = (uint)Marshal.SizeOf(typeof(T));

    res = DriverAPINativeMethods.MemoryManagement.cuMemAlloc_v2(ref _devPtr, _typeSize * size);
    Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cuMemAlloc", res));
    if (res != CUResult.Success) throw new CudaException(res);
    _isOwner = true;
}

It is because it implicitly assumes that the size of managed and unmanaged arrays is the same. sizeof<bool> gives me 1 on the F# side and I am representing booleans using chars on the Cuda side.

The Marshal.SizeOf function is also used in the CudaContext's CopyToHost and in various other places. Maybe it would be worth reconsidering that design choice at this point? Though since I have my own language, I should just write my own transfer functions at this point.

Sorry to take up your time. I wish this were the end of my issues with the library I am making, but I have other bugs to catch. Hopefully nothing else will crop up with regards to ManagedCuda.

kunzmi commented 6 years ago

Well bool is problematic, it's not a blittable type, meaning it gets copied for p/invoke and not pinned. And it gets copied element by element with conversion and not the complete array at once... bool can also have different sizes depending on your C compiler: Usually bool is the same as int and herewith 4 bytes long (a std::vector is even allowed to have one bit per value...). So long story short: Don't use bool in device and host code and replace it by int or if you want to save space use byte (char in Cuda). As it seems bool is even different in C# as in F#, but there's also a difference if you get the size using the sizeof() operator (managed .net size) or Marshal.SizeOf() (size in unmanaged memory after marshalling).

ManagedCuda was designed in the way that only blittable types (float, int, etc...) can be used for CudaDeviceMemory. There's no compile time check for this to disallow types like bool, but if you only use these blittable types, sizeof() and Marshal.SizeOf() are per defininition identical. I didn't want to introduce size and array length checks in each copy method for performance reasons.

mrakgr commented 6 years ago

I don't think bool is different between C# and F# as in both languages they are aliases for System.Boolean, but I think I will take your advice and restrict booleans from traversing language boundaries.

Edit: Also chars seem to be 2 bytes on the managed side and 1 byte when marshalled so that is one other thing to restrict.

mrakgr commented 6 years ago

I was not sure whether there are some types that get extra padding in arrays so I asked how to get the exact size just to be sure. Probably there won't be a case where this is needed because sizeof should suffice, but it can never hurt to know.

I do not know what do you intend to do with regard to this issue, but I suggest forbidding those types whose managed and unmanaged sizes do not match in CudaDeviceVariable and in those generic functions that do interop. Alternatively, it might be possible to make specialized functions for bool and char to get around this issue. Personally, I'd do the former as the later puts you on the hook for literally every possible type and you'd need to do the former anyway in that case.

kunzmi commented 6 years ago

char in .net is not the same as char in C/C++, char in .net is a character with encoding larger than a byte. The equivalent of char in C/C++ in .net would be sbyte or byte (unsigned / signed). The problem with these types is not the padding: These types have a completely different memory layout and need to be converted one by one by the CLR in order to trespass the p/invoke barrier!

I'm aware of this problem since the first days of managedCuda. Until now and for the moment I won't do anything regarding this issue as I simply just don't know how to fix it. I don't want boundary checks at runtime for performance reasons. And I don't know how to test at compile time for blittable types. In C# I can restrict CudaDeviceVariable to only struct types, but .net doesn't expose the interop behaviour. If an array gets pinned and only a pointer is passed to a native library or if the data is copied and converted under the hood is hidden. For the moment I just can tell people to not use blittable types (like bool or char in .net). First this is a general advise for all p/invoke usages, but especially for Cuda: These types get all converted by the CRL on CPU, so why use the GPU again if we loose all the performance gain here?

Basically the only types one can use are sbyte, byte, short, ushort, int, uint, long, ulong, float, double and combinations of those in a struct. But I can't restrict to these primitive basic types without interfering with these combinatorial structs...

mrakgr commented 6 years ago

Fair enough. I've already imposed this restriction at compile time in Spiral, but simply telling the users to be careful about passing non blittable types would be a pragmatic way of dealing this in .NET languages.

Just so you know this is actually my first time hearing about blittable types. It occured to me that there might be differences between some types in .NET land and elsewhere, but I sort of assumed that ManagedCuda would take care of it since there was no type error at compile time.

I do not think I've seen a warning anywhere in the documentation not to pass bools and chars, so highlighting that pitfall would be good so nobody else wanders into it by accident.