Open mrakgr opened 6 years ago
The array aDest is not pinned here by managedCuda because a .net-array gets automatically pinned and released by the CLR when going through P/Invoke. The generic arrays only get manually pinned in oder to avoid too many definitions of the cuMemcpyDtoH-function and replace them by a single IntPtr.
My first thought when reading your posts was, that most likely the allocated size of aSource doesn't match to aDest. But this should be properly handled by the Cuda-API?
What happens if you use CopyToHost<float>()
instead? Does this work as it should?
For debugging and checking I'd add something like:
float[] dest = new float[100]; //your destination array of desired size
CUdeviceptr source = xyz; //your device pointer allocated somewhere
//Create a CudaDeviceVariable for size checking:
CudaDeviceVariable<float> test = new CudaDeviceVariable<float>(source); //gets array size through Cuda API
if (test.Size != 100) do something to complain...
//copy directly using CudaDeviceVariable
test.CopyToHost(dest);
//copy using CudaContext-method:
ctx.CopyToHost(dest, source);
//test is not the owner of source, so no need to dispose or what so ever...
Before I do as you suggest, let me just bring up a point that has been bothering me for about two years now as the situation is certainly appropriate for it.
The DeviceToHost
function has always been suspicious to me. Back when I was making the very first ML library, I ran into some issues where I would transfer an device array to host and when I printed it out, I found that some of the time, half of its fields would be zeroes out. I managed to 'solve' that issue by putting DeviceSynchronize
after the call, but that always bothered me a little because the transfer to host function was already supposed to be synchronous.
But if I assume that it was due to the GC janking the array in the middle of the transfer then that experience starts to make sense.
Back then I was using the CudaDeviceVariable
class, so I'll start by checking whether those class' functions are doing pinning.
Let me go through your suggestions.
It seems that assigning the blame to the float32 function was too early. The function that is doing the corruption is the generic one in fact which copies to a bool array.
At any rate, I've figured it out. You were right about this being about array lengths.
open ManagedCuda
open ManagedCuda.BasicTypes
let ctx = new CudaContext()
let size = 10240
let x = new CudaDeviceVariable<bool>(SizeT size)
x.SizeInBytes |> printfn "%A" // is size * 4
for i=1 to 10 do
let y = Array.zeroCreate<bool>(size*3) // It does work with x4 though, but not with x1, x2 and x3 as shown here.
x.CopyToHost(y)
F# arrays are 1 byte per bool, but here the CudaDeviceVariable
array is clearly 4 bytes per bool. The error does not show up when the array is really small, but it does with larger values of it. I wonder why the CudaDeviceVariable
takes 4 bytes per bool?
I can't replicate the CopyToHost
synchronization bug I had two years ago - maybe it got fixed in the interim? At any rate, the cause of this issue is crystal clear to me now.
public CudaDeviceVariable(SizeT size)
{
_devPtr = new CUdeviceptr();
_size = size;
_typeSize = (uint)Marshal.SizeOf(typeof(T));
res = DriverAPINativeMethods.MemoryManagement.cuMemAlloc_v2(ref _devPtr, _typeSize * size);
Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cuMemAlloc", res));
if (res != CUResult.Success) throw new CudaException(res);
_isOwner = true;
}
It is because it implicitly assumes that the size of managed and unmanaged arrays is the same. sizeof<bool>
gives me 1 on the F# side and I am representing bool
eans using char
s on the Cuda side.
The Marshal.SizeOf
function is also used in the CudaContext
's CopyToHost
and in various other places. Maybe it would be worth reconsidering that design choice at this point? Though since I have my own language, I should just write my own transfer functions at this point.
Sorry to take up your time. I wish this were the end of my issues with the library I am making, but I have other bugs to catch. Hopefully nothing else will crop up with regards to ManagedCuda
.
Well bool is problematic, it's not a blittable type, meaning it gets copied for p/invoke and not pinned. And it gets copied element by element with conversion and not the complete array at once...
bool can also have different sizes depending on your C compiler: Usually bool is the same as int and herewith 4 bytes long (a std::vector
ManagedCuda was designed in the way that only blittable types (float, int, etc...) can be used for CudaDeviceMemory. There's no compile time check for this to disallow types like bool, but if you only use these blittable types, sizeof() and Marshal.SizeOf() are per defininition identical. I didn't want to introduce size and array length checks in each copy method for performance reasons.
I don't think bool
is different between C# and F# as in both languages they are aliases for System.Boolean
, but I think I will take your advice and restrict booleans from traversing language boundaries.
Edit: Also char
s seem to be 2 bytes on the managed side and 1 byte when marshalled so that is one other thing to restrict.
I was not sure whether there are some types that get extra padding in arrays so I asked how to get the exact size just to be sure. Probably there won't be a case where this is needed because sizeof
should suffice, but it can never hurt to know.
I do not know what do you intend to do with regard to this issue, but I suggest forbidding those types whose managed and unmanaged sizes do not match in CudaDeviceVariable
and in those generic functions that do interop. Alternatively, it might be possible to make specialized functions for bool
and char
to get around this issue. Personally, I'd do the former as the later puts you on the hook for literally every possible type and you'd need to do the former anyway in that case.
char in .net is not the same as char in C/C++, char in .net is a character with encoding larger than a byte. The equivalent of char in C/C++ in .net would be sbyte or byte (unsigned / signed). The problem with these types is not the padding: These types have a completely different memory layout and need to be converted one by one by the CLR in order to trespass the p/invoke barrier!
I'm aware of this problem since the first days of managedCuda. Until now and for the moment I won't do anything regarding this issue as I simply just don't know how to fix it. I don't want boundary checks at runtime for performance reasons. And I don't know how to test at compile time for blittable types. In C# I can restrict CudaDeviceVariable to only struct types, but .net doesn't expose the interop behaviour. If an array gets pinned and only a pointer is passed to a native library or if the data is copied and converted under the hood is hidden. For the moment I just can tell people to not use blittable types (like bool or char in .net). First this is a general advise for all p/invoke usages, but especially for Cuda: These types get all converted by the CRL on CPU, so why use the GPU again if we loose all the performance gain here?
Basically the only types one can use are sbyte, byte, short, ushort, int, uint, long, ulong, float, double
and combinations of those in a struct. But I can't restrict to these primitive basic types without interfering with these combinatorial structs...
Fair enough. I've already imposed this restriction at compile time in Spiral, but simply telling the users to be careful about passing non blittable types would be a pragmatic way of dealing this in .NET languages.
Just so you know this is actually my first time hearing about blittable types. It occured to me that there might be differences between some types in .NET land and elsewhere, but I sort of assumed that ManagedCuda
would take care of it since there was no type error at compile time.
I do not think I've seen a warning anywhere in the documentation not to pass bool
s and char
s, so highlighting that pitfall would be good so nobody else wanders into it by accident.
The generic one does, but the autogenerated ones like the above don't. This actually gave me quite a lot of trouble. Please fix this as soon as possible.
Also if you've got the time, check out the Spiral language. At this point I am a long time user of ManagedCuda and I am using it for all the Cuda interop, so it might as well be as good time as in to say thanks. The decision of whether to compile to .NET would have been a lot harder for me without it existing.