m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.38k stars 117 forks source link

RNG<XorShift64Star> corrupts data in ArrayView #1030

Closed uler3161 closed 1 year ago

uler3161 commented 1 year ago

I've been trying to implement MergeShuffle with ILGPU and the Cuda accelerator. Worked ok (though quite slow) with smaller arrays. But was getting an illegal memory access was encoutered messages on larger arrays. I was eventually able to narrow it down to having something to do with the random number generator I'm using.

Here is an example program that causes issues. You may need to try different values for Size in order to duplicate:

public class CorruptionTest : IDisposable
    {
        private const int Size = 70000;
        private readonly Accelerator _accelerator;
        private readonly Context _context;
        private readonly Action<Index1D, RNGView<XorShift64Star>, ArrayView<int>> _kernel;
        private readonly MemoryBuffer1D<int, Stride1D.Dense> _lookups;
        private readonly RNG<XorShift64Star> _random;

        public CorruptionTest()
        {
            _context = BuildContext();
            _accelerator = _context.CreateCudaAccelerator(0);
            _random = BuildRandom();

            var lookupArray = new[] { 1 };
            _lookups = _accelerator.Allocate1D(lookupArray);

            _kernel = BuildKernel();
        }

        public void Dispose()
        {
            _random.Dispose();
            _lookups.Dispose();
            _accelerator.Dispose();
            _context.Dispose();
        }

        public void Execute()
        {
            var rngView = _random.GetView(_accelerator.WarpSize);
            _kernel(Size, rngView, _lookups.View);
            _accelerator.Synchronize();
        }

        private static Context BuildContext()
        {
            var builder = Context.Create();
            builder.IOOperations()
                .AllAccelerators();

            return builder.ToContext();
        }

        private Action<Index1D, RNGView<XorShift64Star>, ArrayView<int>> BuildKernel()
        {
            return _accelerator.LoadAutoGroupedStreamKernel((
                Index1D index,
                RNGView<XorShift64Star> rng,
                ArrayView<int> lookups) =>
            {
                Interop.WriteLine("{0}", lookups[0]);
                rng.NextLong();
            });
        }

        private RNG<XorShift64Star> BuildRandom()
        {
            var random = new Random((int)DateTime.Now.Ticks);
            return RNG.Create<XorShift64Star>(_accelerator, random);
        }
    }

What I'm expecting is this should print out 1 70000 times. Instead, it prints 1 for awhile and then it seems to change to some random number such as 33554433.

Commenting out the rng.NextLong() line fixes the issue. It's almost as if the random number generator is trying to write data somewhere out of bounds and it just happens to be writing into the lookups ArrayView.

m4rs-mt commented 1 year ago

Hi @uler3161 and welcome to the ILGPU community. Regarding your performance issues you mentioned: Please note that ILGPU needs time to compile your kernel and load it into driver memory for execution the first time the kernel delegate is invoked. Usually, more than 80% of the time required to load the kernel is related to driver overhead. To get reliable measurements we recommend a few warmup runs before conducting your actual performance measurements. Assuming that your kernel is memory bound, you may want to enable O2 optimizations to benefit from IO performance improvements.

Coming back to your reported issue. Indeed we got similar reports in the past about the RNG class which uses low-level optimizations to reduce memory consumption in favor of shared RNG states across members in a warp. Unfortunately the RNGView structure was never intended to be used in combination with AutoGrouped kernels, as it cannot be guaranteed that the number of launched threads is a multiple of the warp size and/or the group size. Important is also the maximum number of parallel threads used to access the RNGView structure during runtime. The RNG.Create function has an overload the takes the maximum number of parallel warps in flight at runtime to avoid invalid memory accesses, and thus corruption. By default, the maximum number of parallel warps is equal to the MaxNumThreads for the current accelerator divided by the WarpSize.

Depending on your use case you need to ensure that the number of processing elements is a multiple of the group size and/or the warp size. Alternatively, you may want to experiment with the newly added ThreadWiseRNG class we just added in PR #1022. This does not use any optimizations and stores a single RNG instance per parallel thread.

uler3161 commented 1 year ago

@m4rs-mt thanks for all the info. As for the performance, at one point I did do a single initial execution outside of my timing loop and still seemed slow. I guess I could do multiple and see if it helps. But also, I haven't had the ability to test with a larger array as I think that might be where I'll see it perform well. Or it could just be I'm not coding it quite right.

I haven't looked at ThreadWiseRNG yet, but I did try to follow your other suggestions and I have it working better, but still having the same problem.

First thing I wasn't sure on was what you meant by group size. I'm new enough to ILGPU and Cuda in general that I don't have a firm grip on the terminology. Is this the value reported by Accelerator.MaxNumThreadsPerGroup? Or is this a value I need to be getting some other way?

Next is how I should be creating the kernel. It sounds like LoadAutoGroupedStreamKernel is something I shouldn't use regardless of whether I'm processing something sized to a multiple of group size and/or warp size? I changed my example code to use LoadStreamKernel instead. Is that ok or do I need to use something else?

After using LoadStreamKernel, I changed the Size value from 70000 to 128000. Warp size is being reported as 32. MaxNumThreadsPerGroup is reported as 1024, so I was thinking 128000 is a good value. At this point I was still getting the issue.

Next, I changed the RNG.Create call and passed in MaxNumThreads to that overloaded method you mentioned. I was able to run a much larger Size value, but a large enough value still ends in failure.

Here's my current test code:

public class CorruptionTest : IDisposable
{
    private const int Size = 16777216;
    private readonly Accelerator _accelerator;
    private readonly Context _context;
    private readonly Action<KernelConfig, RNGView<XorShift128Plus>, ArrayView<int>> _kernel;
    private readonly MemoryBuffer1D<int, Stride1D.Dense> _lookups;
    private readonly RNG<XorShift128Plus> _random;

    public CorruptionTest()
    {
        _context = BuildContext();
        _accelerator = _context.CreateCudaAccelerator(0);
        _random = BuildRandom();

        var lookupArray = new[] { 1 };
        _lookups = _accelerator.Allocate1D(lookupArray);

        _kernel = BuildKernel();
    }

    public void Dispose()
    {
        _random.Dispose();
        _lookups.Dispose();
        _accelerator.Dispose();
        _context.Dispose();
    }

    public void Execute()
    {
        Console.WriteLine($"Warp Size: {_accelerator.WarpSize}");
        Console.WriteLine($"Group Size: {_accelerator.MaxNumThreadsPerGroup}");

        var rngView = _random.GetView(_accelerator.WarpSize);

        var groupSize = _accelerator.MaxNumThreadsPerGroup;
        KernelConfig launchDimension = ((Size + groupSize - 1) / groupSize, groupSize);

        _kernel(launchDimension, rngView, _lookups.View);
        _accelerator.Synchronize();
    }

    private static Context BuildContext()
    {
        var builder = Context.Create();
        builder.IOOperations()
            .AllAccelerators();

        return builder.ToContext();
    }

    private Action<KernelConfig, RNGView<XorShift128Plus>, ArrayView<int>> BuildKernel()
    {
        return _accelerator.LoadStreamKernel<RNGView<XorShift128Plus>, ArrayView<int>>(TestKernel);
    }

    private RNG<XorShift128Plus> BuildRandom()
    {
        var random = new Random((int)DateTime.Now.Ticks);
        return RNG.Create<XorShift128Plus>(_accelerator, random, _accelerator.MaxNumThreads);
    }

    private static void TestKernel(
        RNGView<XorShift128Plus> rng,
        ArrayView<int> lookups)
    {
        rng.Next();
        if (lookups[0] != 1)
        {
            Interop.WriteLine("{0}", lookups[0]);
        }
    }
}
uler3161 commented 1 year ago

I decided to go the route of using ThreadWiseRNG. Seems to have fixed the problem. Unfortunately it appears my MergeShuffle algorithm is very slow compared to just running Fisher-Yates with standard C# code and in one thread. I did try the O2 optimization, but it actually resulted in taking about 3x longer.

uler3161 commented 1 year ago

I've given up trying to get my algorithm to run fast. Not sure why it wouldn't work, but I came up with another shuffling algorithm that seems to work well, so I'm going that route. Thanks for the help.