m4rs-mt / ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs
http://www.ilgpu.net
Other
1.34k stars 115 forks source link

[QUESTION]: Random number generator yields unexpected results for larger number of threads #1250

Closed AmosEgel closed 1 month ago

AmosEgel commented 2 months ago

Question

First of all, thanks for providing and maintaining this beautiful and extra-helpful package!

I try to generate a histogram of random numbers. It works well as long as the number of total threads does not exceed a certain limit. For large numbers of threads, the resulting histogram does not show an even distribution any longer. I suspect that I do something wrong in the way how I construct and use the random number generator. I would appreciate any hint to direct me what I am doing wrong.

Environment

Additional context

With the following kernel I try to write random numbers into a histogram:

        static void HistogramKernel(Index1D index, RNGView<XorShift64Star> rng, ArrayView1D<int, Stride1D.Dense> view)
        {
            int numBins = (int)view.Length;
            float rand = rng.NextFloat();
            int histoIdx = (int)(numBins * rand);
            if (histoIdx == numBins) histoIdx -= 1;
            Atomic.Add(ref view[histoIdx], 1);
        }

The kernel is called with this method:

public static int[] GetHistogram(int numThreads, int numBins)
        {
            using (Context context = Context.CreateDefault())
            {
                Device d = null;
                foreach (Device dev in context)
                {
                    if (dev.Name == "Intel(R) UHD Graphics 620") d = dev;
                }
                using (Accelerator accelerator = d.CreateAccelerator(context))
                {
                    var random = new Random();
                    using (var rng = RNG.Create<XorShift64Star>(accelerator, random))
                    {
                        RNGView<XorShift64Star> rngView = rng.GetView(accelerator.WarpSize);
                        using (MemoryBuffer1D<int, Stride1D.Dense> histogramDevice = accelerator.Allocate1D<int>(numBins))
                        {
                            var kernel = accelerator.LoadAutoGroupedStreamKernel<
                                Index1D, RNGView<XorShift64Star>, ArrayView1D<int, Stride1D.Dense>>(HistogramKernel);
                            kernel(numThreads, rngView, histogramDevice.View);
                            int[] histogramHost = histogramDevice.GetAsArray1D();
                            return histogramHost;
                        }
                    }
                }
            }
        }

For numThreads = 10000 and numBins = 100, I get the following result (pretty much as expected):

grafik

However, when increasing the number of threads to e.g. numThreads = 100000, the result looks like this:

grafik

The distribution is no longer even.

AmosEgel commented 2 months ago

Update As a workaround, I am trying the following:

The kernel reads:

static void HistogramKernel(Index1D index, ArrayView1D<uint, Stride1D.Dense> seeds, int numNumbers, ArrayView1D<int, Stride1D.Dense> view)
{
    XorShift64Star rng = new XorShift64Star(seeds[index]);
    int numBins = (int)view.Length;        
    for (int i = 0; i < numNumbers; i++)
    {
        float rand = rng.NextFloat();
        int histoIdx = (int)(numBins * rand);
        if (histoIdx == numBins) histoIdx -= 1;
        Atomic.Add(ref view[histoIdx], 1);
    }
}

This strategy seems to do what I need. I still don't understand what I was doing wrong in the original attempt, though - so any explanation or correction would still be highly welcome.

By the way, I can imagine that the above workaround is not ideal regarding run time performance. However, that may not be a big problem, because the actual program that we want to run on the GPU does heavy computations, so the overhead of creating more RNG instances than necessary might not be significant in the end.

m4rs-mt commented 1 month ago

Hi @AmosEgel, welcome to the ILGPU community! I apologize for the delayed response - most of our team was off during the past weeks. Indeed, the RNG<T> was never designed to be used from AutoGrouped kernels. In order to address your use case you may want to take a look at ThreadWiseRNG available in 2.0beta1. This should solve your problems.

AmosEgel commented 1 month ago

Great, thanks a lot for the explanation and the hint to ThreadWiseRNG.