AccessViolationException when copying to device

I have a rather strange issue. I am unable to copy boolean array to device. (And read from it). This is only reproducible for boolean types so far. I am able to copy bool of smaller size just fine, though.

The code:

    static void Main(string[] args)
    {
        Console.WriteLine("start");

        CudaContext cntxt = new CudaContext();
        CUmodule cumodule = cntxt.LoadModule(@"kernel.ptx");
        CudaKernel myKernel = new CudaKernel("kernel", cumodule, cntxt);

        bool[] bools = new bool[20000];
        var dev = new CudaDeviceVariable<bool>(bools.Length);

        // System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'
        dev.CopyToDevice(bools); // here

        Console.WriteLine("end");
        Console.ReadKey();
    }

Exception: System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'

Stack trace: StackTrace " at ManagedCuda.DriverAPINativeMethods.SynchronousMemcpy_v2.cuMemcpyHtoD_v2(CUdeviceptr dstDevice, IntPtr srcHost, SizeT ByteCount)\r\n
at ManagedCuda.CudaDeviceVariable`1.CopyToDevice(T[] source) in i:\ManagedCuda\managedCuda\ManagedCUDA\CudaDeviceVariable.cs:line 327\r\n
at Bug_managedCuda_c_001.Program.Main(String[] args) in c:\users\lucky\source\repos\Bug_managedCuda_001\Bug_managedCuda_c_001\Program.cs:line 25" string

I have initially stumbled upon this issue on trunc version of the code because I wanted to use CUDA 9. Hover, the exact same issue is also reproduced for "_80" version from Nuget. For the above test I used the above mentioned "_80" version with ptx file generated in cuda9 project. I don't think ptx matters at this point, since I never get to the actual execution of the code on GPU. At this point I am just trying to allocate space to device.

I am using a GeForce 1080 Ti graphics card with 11 Gb of memory. Allocation of boolean array of such small size (20000) should not be an issue. Using .Net framework 4.6/4.7.1 console app, compiled in "any cpu" and in "x64". "Prefer 32 bit" unchecked.

If any additional info is needed please let me know.

This is because boolean is not a blittable type: The size of one boolean value is not the same in C# as in C/C++/Cuda. Instead of boolean, use byte in C# and unsigned char in Cuda or int in both. Unfortunately there's no API in .net to ensure that only blittable types are used for CudaDeviceVariable...

Hi,

Thank you for the quick answer, however, I have noticed that the code, even in a way it is build now - executed correctly even if I use booleans in parameters. I can, however, use byte, as you mentioned. But before that...

I have prepared a very interesting scenario. Can you please take a look once again at the issue?

For example - this code is working fine:

    public static class InOut
    {
        public static bool[] input = new bool[20_000];

        static InOut()
        {
            // commend out this line here:
            input = new bool[20_000];
        }
    }

    class Program
    {
        static CudaDeviceVariable<bool> bools_on_device;

        static void Main(string[] args)
        {
            Console.WriteLine("start");

            CudaContext cntxt = new CudaContext();
            CUmodule cumodule = cntxt.LoadModule(@"kernel.ptx");
            CudaKernel myKernel = new CudaKernel("test", cumodule, cntxt);

            bool[] bools_on_host = new bool[20_000];
            bools_on_device = new CudaDeviceVariable<bool>(InOut.input.Length); // use bools_on_host.Length instead

            bools_on_device.CopyToDevice(bools_on_host); // exception here

            myKernel.Run(bools_on_device.DevicePointer);

            bools_on_device.CopyToHost(bools_on_host);

            Console.WriteLine("end");
            Console.ReadKey();
        }
    }

However, if you comment out "input = new bool[20_000];" in static initilalizer - then the exception is thrown! Alternatively, you can also change "InOut.input.Length" with "bools_on_host.Length", then the exception will also be thrown. If the code is being executed without any of those 2 modifications, then I get the correct result from GPU.

The .cu code is following:

extern "C"
{
    __global__ void test(bool* __restrict__  output)
    {
        int index = threadIdx.x;
        output[index] = (index % 2 == 0);
    }
}

the result is: true, false, true, false ....

To me, it's not exactly clear how my 2 modifications can possibly affect the memory allocation to/from device. It looks like some kind of initialization timing issue in the .net itself rather than inability of CUDA to manage booleans internally.. If you can confirm it, then I can reopen the issue in a more appropriate place :)

In any case thank you for support.

Edit: To support my finding even further I did one more test.

if we use an array of size 200k instead of 20k, then basically the code will be following:

public static class InOut
{
    public static bool[] input = new bool[200_000];

    static InOut()
    {
        input = new bool[200_000];

        // withot this line exception will be thrown
        input = new bool[200_000];
    }
}
// also set the same size for bools_on_host

BUT, the code will not work, unless we put 2 lines "input = new bool[200_000];". This doesn't make any sense to me no matter how I look at it.

If your code works or not is completely random: Inside .net a Boolean value is 1 byte of memory, in C/C++/Cuda its 1 or four bytes depending on compiler (and maybe also settings -> 32/64 bit?). If you transfer data through the managed .net to unmanaged native world, and this is what is happening in managedCuda when doing a CopyToDevice or CopyToHost, then .net is converting every value in an boolean array to a four byte value. This is standard .net behavior I can't influence. Only the basic standard datatypes like int, float, byte, short, etc. get through this managed barrier without conversion, these are the so called blittable types.

When working with blittable types, garbage collector gets noticed to not move around the given array for the time of processing the native function call and no other conversion takes place under the hood. For non-blittable types on the other hand, .net first allocates a temporary array, converts all array elements to a native form and then passes this temporary array to the native function. First of all, this is a huge performance decrease and further the size of the converted datatype is different (can be retrieved by calling Marshal.SizeOf and not sizeof, this is also the size retrieved by managedCuda). As there are multiple conventions in native world, there is no guarantee that the sizes fit to the native program.

If you now allocate memory somewhere on the heap, the native copyToHost/Device function may actually read/write from allocated memory by chance, if not, an exception is thrown. But as allocations may be moved around by GC, this is an entirely random process. Also the size of allocations determine the time of cleanup cycles of the GC... If your code works, then only by chance!

In short: Don't use non-blittable types for trespassing the managed/unmanaged memory barrier. This is not a managedCuda specific problem...

kunzmi / managedCuda

AccessViolationException when copying to device #59