[QUESTION]: Random number generator yields unexpected results for larger number of threads

AmosEgel commented 2 months ago

Question

First of all, thanks for providing and maintaining this beautiful and extra-helpful package!

I try to generate a histogram of random numbers. It works well as long as the number of total threads does not exceed a certain limit. For large numbers of threads, the resulting histogram does not show an even distribution any longer. I suspect that I do something wrong in the way how I construct and use the random number generator. I would appreciate any hint to direct me what I am doing wrong.

Environment

ILGPU version: 1.5.1
ILGPU.Algorithms version: 1.5.1
.NET version: .NET Framework 4.7.2
Operating system: Windows 10
Hardware (if GPU-related): Intel(R) UHD Graphics 620
The GPU is also used for display

Additional context

With the following kernel I try to write random numbers into a histogram:

        static void HistogramKernel(Index1D index, RNGView<XorShift64Star> rng, ArrayView1D<int, Stride1D.Dense> view)
        {
            int numBins = (int)view.Length;
            float rand = rng.NextFloat();
            int histoIdx = (int)(numBins * rand);
            if (histoIdx == numBins) histoIdx -= 1;
            Atomic.Add(ref view[histoIdx], 1);
        }

The kernel is called with this method:

public static int[] GetHistogram(int numThreads, int numBins)
        {
            using (Context context = Context.CreateDefault())
            {
                Device d = null;
                foreach (Device dev in context)
                {
                    if (dev.Name == "Intel(R) UHD Graphics 620") d = dev;
                }
                using (Accelerator accelerator = d.CreateAccelerator(context))
                {
                    var random = new Random();
                    using (var rng = RNG.Create<XorShift64Star>(accelerator, random))
                    {
                        RNGView<XorShift64Star> rngView = rng.GetView(accelerator.WarpSize);
                        using (MemoryBuffer1D<int, Stride1D.Dense> histogramDevice = accelerator.Allocate1D<int>(numBins))
                        {
                            var kernel = accelerator.LoadAutoGroupedStreamKernel<
                                Index1D, RNGView<XorShift64Star>, ArrayView1D<int, Stride1D.Dense>>(HistogramKernel);
                            kernel(numThreads, rngView, histogramDevice.View);
                            int[] histogramHost = histogramDevice.GetAsArray1D();
                            return histogramHost;
                        }
                    }
                }
            }
        }

For numThreads = 10000 and numBins = 100, I get the following result (pretty much as expected):

grafik

However, when increasing the number of threads to e.g. numThreads = 100000, the result looks like this:

grafik

The distribution is no longer even.

AmosEgel commented 2 months ago

Update As a workaround, I am trying the following:

On the host, create a seed array of random unsigned integers. The length of the array corresponds to the number of threads that shall be started on the GPU
Create a device array copy of the seeds array and provide it as an input argument to the kernel.
In the kernel, create an instance of XorShift64Star random number generator, using the i-th element of the seeds array as the seed, where i is the thread index.
Then use that random number generator to generate multiple random numbers inside the kernel.

The kernel reads:

static void HistogramKernel(Index1D index, ArrayView1D<uint, Stride1D.Dense> seeds, int numNumbers, ArrayView1D<int, Stride1D.Dense> view)
{
    XorShift64Star rng = new XorShift64Star(seeds[index]);
    int numBins = (int)view.Length;        
    for (int i = 0; i < numNumbers; i++)
    {
        float rand = rng.NextFloat();
        int histoIdx = (int)(numBins * rand);
        if (histoIdx == numBins) histoIdx -= 1;
        Atomic.Add(ref view[histoIdx], 1);
    }
}

This strategy seems to do what I need. I still don't understand what I was doing wrong in the original attempt, though - so any explanation or correction would still be highly welcome.

By the way, I can imagine that the above workaround is not ideal regarding run time performance. However, that may not be a big problem, because the actual program that we want to run on the GPU does heavy computations, so the overhead of creating more RNG instances than necessary might not be significant in the end.

m4rs-mt commented 1 month ago

Hi @AmosEgel, welcome to the ILGPU community! I apologize for the delayed response - most of our team was off during the past weeks. Indeed, the RNG<T> was never designed to be used from AutoGrouped kernels. In order to address your use case you may want to take a look at ThreadWiseRNG available in 2.0beta1. This should solve your problems.

AmosEgel commented 1 month ago

Great, thanks a lot for the explanation and the hint to ThreadWiseRNG.

m4rs-mt / ILGPU