mrc-ide / covid-sim

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London.
GNU General Public License v3.0
1.23k stars 256 forks source link

[Medium]Switching to an external/more powerful random number generator #184

Closed Feynstein closed 3 years ago

Feynstein commented 4 years ago

This weekend I will try to branch from your project in order to change the random number generation algorithms that seems to be a bit outdated. I will use this library: https://gitlab.cern.ch/CLHEP/CLHEP and add it as a submodule to your git. And make sure the algorithm choice is a parameter. I might add a boost program options part to deal with parameters also... I haven't had time to check how they are handled. This might be meaning the removal of the Rand.cpp file and related methods, which would make it more modular and with a better overall quality. It might also positively affect performance since this library has been going on for a while and has nice optimizations. I don't know which C++ standard they use, so yeah it might also mean moving up to C++11 or more if it was not already there.

weshinsley commented 4 years ago

Thank you opening this, and for comments elsewhere. Appreciate your time and effort, as well as the things you are exploring in the code.

RANLIB has been around for a while to be sure, and review of our use of it is welcome. It may be that current practice in your field is ahead of ours.

The important things for us about any RNG would be (1) speed, as we make heavy use of sampling, (2) a thread-safe RNG-per-thread one way or another, and (3) we want repeatability from a given seed; those last 2 are about thread-safety and repeatability; we want to generate deterministic results from given seeds. There are always discussions about "how random is random", especially with different seeds per thread, and whether we can improve on RANLIB in that respect is a good question.

We then would typically do large numbers (10,000s) of runs with different seeds to look at the ensemble properties in terms of noise/stochasticity.

Feynstein commented 4 years ago

Nice, thank you for your comment. I knew there were real people behind this :-) 10,000 runs are a piece of cake in particle physics we do a few million events in order to get better accuracy. But that might not be the same thing in that case. And yes, I saw your Rand lib and it's pretty stocked with embedded for loops which is not something you want for speed. I can try to do a first open mp run on the already existing one, but I think that scraping it completely might be a better idea CLHEP is a very powerful tool.

Feynstein commented 4 years ago

Oh and btw can you make sure I get a few thumbs up in the other issue about the original source :P we need to show these guys what's the reality behind all this. I knew it was something with grad students piling over with new features and never actually refactoring it... this is something i've seen too often in academia XD

Feynstein commented 4 years ago

@weshinsley Oh yes, this library is used in multi-thread - single seed - Monte Carlo high energy physics code

weatherhead99 commented 4 years ago

why not use e.g. std::binomial_distribution and std::poisson distribution etc? Looks like a step up from the current implementation and much simpler switch than going to CLHEP (which for large runs does make sense)

Feynstein commented 4 years ago

I can't seem to see from the std::poisson reference page if they're static function or not. I'm not sure if they hold a state? The goal here is repeatability, so they need to hold their state. But I'm not familiar with them. It also seems to me that there are plenty occurences of their use here and there in the code... And I think that the minimum run holds 20GB of ram so in my idea all runs where big... But it's worth looking into, they might be more optimized. There are no references in the examples about seeding... Ohh... ok I see it now... well nice!

weshinsley commented 4 years ago

I raised concerns about the std:: stats library a while back (https://stackoverflow.com/questions/34977728/c-gamma-distribution-returning-infinity/34981326#34981326) and trying to get a fix for that was a tedious and uncertain process; I got the "we can't reproduce this" response for a while, later an acknowledgement it was being looked into, but never got a clear answer on whether this was fixed or not.

That wasn't in the context of this model, but there was no perceived benefit at that time of changing the RNGs, and if we had had any thoughts of moving to std:: at that time, we would have been increasingly cautious to do so as a result.

Relying on open source libraries we can include in full, that are demonstrably very close to the maths has considerable value, and I think it remains to be shown whether there is a driving need for anything external or more powerful than the simple and classical that we are using.

But that is what the open-source process is about, so we appreciate this discussion.

Feynstein commented 4 years ago

@weshinsley I was looking into CLHEP more closely and I remembered how elegantly the units were implemented in Geant4. There is also a whole lot more than the random number generators. I think that at some point maybe some of your more mathematical phenomenons could be described by linear algebra stuff like fields or matrices and others that might be more appropriate for the specific job at hand, and that is very rigourously implemented in CLHEP...

weatherhead99 commented 4 years ago

@weshinsley just to ask if you're aware of a reproduction of this result in any compiler more recent than VS2010? Is there a bugid for it? Also, that all standard libraries are now open source. I think the advantage I'd perceive is that you could drop a lot of custom RNG code, which in particular reading through it is very clearly not thread safe.

StevenWhitehouse commented 4 years ago

I just did a very quick test, to compare the ranf routine in this project with the generator which is recommended in Numerical Recipes (3rd ed, p. 342 which they call Ran). That generator appears to run at around the same speed as ranf:

[steve@fogou test]$ ./rtest ranf 0.896546 secs Ran 0.893404 secs

In both cases thats the time taken to generate 100,000,000 pseudo-random doubles in a uniform distribution between 0 and 1 when run on my laptop.

On page 354 of Numerical Recipes, they give the code for a generator which they claim is faster, if you are specifically after floats between 0 and 1 only (the Ran generator uses mostly integer maths, and can generate psuedo-random integers too). This second generator is not as good as Ran though, it appears. I don't know whether it would make sense to use here, since I'm not quite sure what is "good enough" in this case, so I've not gone the additional step to test it and see how much faster it might be.

I'm happy to share my test code if it is of help, but really all I've done is use the routine in the book. Note the earlier editions of Numerical Recipes have different recommended random number generators, so if you have an older edition, that will be different code.

Hopefully that is a useful data point.

weshinsley commented 4 years ago

@weatherhead99 - as reported in the stackoverflow issue, that was with 2016 compilers. No bugid because there was no public issue tracker to submit bugs against std:: at that time, only email support - also explaining my caution. But not helpful to suggest "clearly not thread safe". It is completely thread-safe in all our Intel Inspector tests, and the code makes sense to us. Submit an issue if you have a counter-example.

@StevenWhitehouse - thanks, but beware licencing of Numerical Recipes. Although we all have the book on our shelves, the licensing is not at all simple, and restricts the open source usage we want. In any case, no strong motivation for us to change what we use.

StevenWhitehouse commented 4 years ago

@weshinsley Yes, licensing can be an issue, although I'm sure we could figure out a way around that by finding a different/similar implementation if it was worthwhile. In this case though, based on the results that I've got, I suspect that any gains that way will be minimal. Also there is potentially a significant investment in time to check a new RNG meets the requirements too, which is not worth it without a significant gain in speed to offset that.

So I think next time I find a spare moment, I'll start looking at the next level up, i.e. the consumers of the random numbers, and see if I spot anything there. At the moment I'm just spending a little time checking things that I hope might be useful to look at, but if you have a todo list or something like that, it might be helpful as a guide.

StevenWhitehouse commented 4 years ago

Some more random thoughts...

I moved my investigation on step up the stack to look at the callers of ranf() and there are a large number of them. So I started with the callers in Rand.cpp, which are generating various distributions. There does seem to be a lot of code duplication there, and in fact ranf and ranf_mt are almost identical, so there is scope for some clean up there.

I also noticed what I think (but have not yet fully investigated) is duplication between snorm and gen_norm which appear to be different routines for generating the same distribution. Although gen_norm takes the usual mean and sd parameters, they are constant 0 and 1 in all callers, but the compiler should sort that out for us.

So the next question was how fast are those routines, with gen_norm using the Marsaglia polar method and snorm looks like it is using a rejection method thats been translated from Fortran. At the moment I've not verified that both are doing the same thing, but the results from a quick test of their relative performance is interesting:

[steve@fogou test]$ perf record ./ntest snorm 2.568932 secs gen_norm 4.037334 secs

So snorm seems much faster than gen_norm, but is it? The answer turns out to be no, since gen_norm actually can generate two values per call, but currently doesn't. The net result is that it could be twice as fast as it is at the moment, so that seems like something worth looking into.

An initial test added a static variable to store the second value, and return it on every other call, and then the performance looked like:

[steve@fogou test]$ ./ntest snorm 2.608799 secs gen_norm 2.110180 secs

So almost a 2x gain in performance. That isn't thread safe, of course, but it shows that gen_norm can be trivially speeded up. So next step was to make it thread safe. I looked at ranf and ranf_mt to see how that had been done, and the answer seemed to be that it was done badly. The Xcg1 & 2 variables are much larger than actually required, 512 bytes per thread. The CACHE_LINE_SIZE which is in bytes was being multiplied by the sizeof(long) resulting in a very large size. Also, Xcg1 & 2 for the same thread definitely do want to be in the same cache line! So having them separate like that makes no sense at all.

So with all that in mind, I came up with a patch, which I'll now attempt to copy & paste into here :-)

You'll likely not be able to make this apply if you copy & paste it from here, so I'll send it to anybody who want me to email it to them. It should give a rough idea of what I'm thinking though. Benefits are:

Caveat is that it needs careful review and testing. Also, I don't dare replace snorm calls with gen-norm until I'm 100% sure that they are the same, but there would be a small additional performance benefit if that was possible. Hopefully this is useful info :-)


diff --git a/src/Rand.cpp b/src/Rand.cpp
index c0fa05e..3a47d86 100644
--- a/src/Rand.cpp
+++ b/src/Rand.cpp
@@ -8,13 +8,22 @@
 #ifdef _OPENMP
 #include <omp.h>
 #endif
+
 /* RANDLIB macros*/
 #define ABS(x) ((x) >= 0 ? (x) : -(x))
 #define minF(a,b) ((a) <= (b) ? (a) : (b))
 #define maxF(a,b) ((a) >= (b) ? (a) : (b))

+struct rngstate {
+       long Xcg1;
+       long Xcg2;
+       double ncache;
+       unsigned char __pad[CACHE_LINE_SIZE-2*sizeof(long)-sizeof(double)];
+};
+
+static struct rngstate rngstate[MAX_NUM_THREADS];
+
 /* RANDLIB static variables */
-long* Xcg1, *Xcg2;
 int **SamplingQueue = nullptr;

 ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* ///////////// ********* 
@@ -23,44 +32,30 @@ int **SamplingQueue = nullptr;

 double ranf(void)
 {
-       long k, s1, s2, z;
-       unsigned int curntg;
+       unsigned int tn;
  #ifdef _OPENMP
-       curntg = CACHE_LINE_SIZE * omp_get_thread_num();
+       tn = omp_get_thread_num();
 #else
-       curntg = 0;
+       tn = 0;
 #endif
-       s1 = Xcg1[curntg];
-       s2 = Xcg2[curntg];
-       k = s1 / 53668L;
-       s1 = Xa1 * (s1 - k * 53668L) - k * 12211;
-       if (s1 < 0) s1 += Xm1;
-       k = s2 / 52774L;
-       s2 = Xa2 * (s2 - k * 52774L) - k * 3791;
-       if (s2 < 0) s2 += Xm2;
-       Xcg1[curntg] = s1;
-       Xcg2[curntg] = s2;
-       z = s1 - s2;
-       if (z < 1) z += (Xm1 - 1);
-       return ((double)z) / Xm1;
+       return ranf_mt(tn);
 }
+
 double ranf_mt(int tn)
 {
        long k, s1, s2, z;
-       int curntg;

-       curntg = CACHE_LINE_SIZE * tn;
-       s1 = Xcg1[curntg];
-       s2 = Xcg2[curntg];
+       s1 = rngstate[tn].Xcg1;
+       s2 = rngstate[tn].Xcg1;
        k = s1 / 53668L;
        s1 = Xa1 * (s1 - k * 53668L) - k * 12211;
        if (s1 < 0) s1 += Xm1;
        k = s2 / 52774L;
        s2 = Xa2 * (s2 - k * 52774L) - k * 3791;
        if (s2 < 0) s2 += Xm2;
-       Xcg1[curntg] = s1;
-       Xcg2[curntg] = s2;
+       rngstate[tn].Xcg1 = s1;
+       rngstate[tn].Xcg2 = s2;
       z = s1 - s2;
        if (z < 1) z += (Xm1 - 1);
        return ((double)z) / Xm1;
@@ -92,8 +87,8 @@ void setall(long *pseed1, long *pseed2)
        long iseed2 = *pseed2;

        for (g = 0; g < MAX_NUM_THREADS; g++) {
-               *(Xcg1 + g * CACHE_LINE_SIZE) = iseed1 = mltmod(Xa1vw, iseed1, Xm1);
-               *(Xcg2 + g * CACHE_LINE_SIZE) = iseed2 = mltmod(Xa2vw, iseed2, Xm2);
+               rngstate[g].Xcg1 = iseed1 = mltmod(Xa1vw, iseed1, Xm1);
+               rngstate[g].Xcg2 = iseed2 = mltmod(Xa2vw, iseed2, Xm2);
        }

        *pseed1 = iseed1;
@@ -2042,27 +2037,16 @@ S140:
  */
 double gen_norm(double mu, double sd)
 {
-       double u, v, x, S;
+       unsigned int tn;

-       do
-       {
-               u = 2 * ranf() - 1; //u and v are uniform random numbers on the interval [-1,1]
-               v = 2 * ranf() - 1;
-
-               //calculate S=U^2+V^2
-               S = u * u + v * v;
-       } while (S >= 1 || S == 0);
-
-       //calculate x,y - both of which are normally distributed
-       x = u * sqrt((-2 * log(S)) / S);
-
-       // This routine can be accelerated by storing the second normal value
-       // and using it for the next call.
-       // y = v * sqrt((-2 * log(S)) / S);
-
-       //return x
-       return x * sd + mu;
+#ifdef _OPENMP
+       tn = omp_get_thread_num();
+#else
+       tn = 0;
+#endif
+       return gen_norm_mt(mu, sd, tn);
 }
+
/*function gen_snorm
  * purpose: my own implementation of sampling from a uniform distribution, using Marsaglia polar method, but for multi-threading
  *
@@ -2071,6 +2055,13 @@ double gen_norm(double mu, double sd)
 double gen_norm_mt(double mu, double sd, int tn)
 {
        double u, v, x, S;
+       double t;
+
+       if (rngstate[tn].ncache) {
+               t = rngstate[tn].ncache;
+               rngstate[tn].ncache = 0.0;
+               return t * sd + mu;
+       }

        do
        {
@@ -2082,15 +2073,17 @@ double gen_norm_mt(double mu, double sd, int tn)
        } while (S >= 1 || S == 0);

        //calculate x,y - both of which are normally distributed
-       x = u * sqrt((-2 * log(S)) / S);
+       t = sqrt((-2 * log(S)) / S);
+       x = u * t;

        // This routine can be accelerated by storing the second normal value
        // and using it for the next call.
        //      y = v * sqrt((-2 * log(S)) / S);
+       rngstate[tn].ncache = v * t;

-       //return x
        return x * sd + mu;
 }
+
 /*function gen_gamma
  * purpose: my own implementation of sampling from a gamma distribution, using Marsaglia-Tsang method
  *
diff --git a/src/Rand.h b/src/Rand.h
index 927e48f..ad2974d 100644
--- a/src/Rand.h
+++ b/src/Rand.h
@@ -11,7 +11,6 @@

 /* RANDLIB global variables */
 extern int **SamplingQueue;
-extern long* Xcg1, *Xcg2;
 /* RANDLIB functions */
 long ignbin(long, double);
 long ignpoi(double);
diff --git a/src/SetupModel.cpp b/src/SetupModel.cpp
index ce6f9f1..3b4661b 100644
--- a/src/SetupModel.cpp
+++ b/src/SetupModel.cpp
@@ -32,8 +32,6 @@ void SetupModel(char* DensityFile, char* NetworkFile, char* SchoolFile, char* Re
        char buf[2048];
        FILE* dat;

-       if (!(Xcg1 = (long*)malloc(MAX_NUM_THREADS * CACHE_LINE_SIZE * sizeof(long)))) ERR_CRITICAL("Unable to allocate ranf storage\n");
-       if (!(Xcg2 = (long*)malloc(MAX_NUM_THREADS * CACHE_LINE_SIZE * sizeof(long)))) ERR_CRITICAL("Unable to allocate ranf storage\n");
        P.nextSetupSeed1 = P.setupSeed1;
        P.nextSetupSeed2 = P.setupSeed2;
        setall(&P.nextSetupSeed1, &P.nextSetupSeed2);
weatherhead99 commented 4 years ago

@weshinsley for example the function SampleWithoutReplacement() is not thread safe, it uses a global pointer which does not have thread local storage nor is guarded by any synchronisation primitives.

Lack of thread safety does not imply the appearance of non-deterministic thread related errors, such as would be imported by a tool such as Inspector. As far as I can see that particular example can't result in a deadlock or a logical flow related race hazard, merely possibly a data race.

StevenWhitehouse commented 4 years ago

A random history...

When trying to figure out how to improve something, it is always good to know where you are coming from, so with that in mind, I've tracked down some info about the current random number generator. Hopefully this info is useful... I have some further details that I hope to post in due course, but this seems to be a self-contained topic, and maybe it will be useful to others looking at this.

The random number generator currently in use as ranf was originally described in this paper: http://www-labs.iro.umontreal.ca/~lecuyer/myftp/papers/cacm88.pdf which was published in CACM in June 1988. Fig 3 on page 747 should look very familiar. L'Ecuyer then published a second paper, which is referenced in the code (which helps!) relating to how to set the initial seeds: https://dl.acm.org/doi/pdf/10.1145/103147.103158

L'Ecuyer's random number generator uses two independent LCGs which have been chosen to give a long period, the text claims circa 2.3e18. That seems like it should be more than adequate for this situation since the test data uses (I measured this) approx 1.7e9 random numbers for the longest simulation. So no concerns there.

L'Ecuyer's generator is quite obviously popular, a quick google for some of the constant values reveals a number of other users. The one that seems to top the list (for me, anyway) is gnuplot. Since we mentioned Numerical Recipes above, it is worth noting that this generator appears in their 2nd Ed (On p.273, in my Fortran copy - I gave my 2nd Ed C copy away when I bought the 3rd Ed!) in combination with a Bays-Durham shuffle which was added to break up any serial correlations in the output, implying that they were concerned that the basic algorithm was not good enough. This algorithm was named ran2.

By the time of the 3rd Ed Numerical Recipes, they are warning about the perils of RNGs which rely solely on LCGs due to the generated numbers lying on a limited number of hyperplanes (if one considers vectors of N consecutive random numbers as points in N dimensional space). Their newly recommended generator (Ran) combines a 64 bit LCG (post processed with a simple hash function) with a 64 bit xorshift generator (see this paper by Marsaglia https://www.jstatsoft.org/article/view/v011i05 ) and also an MWC, which are all combined at the output.

It is worth noting there that the ranf generator only generates 32 bits of output, so although my quick test above suggested it was the same speed as Ran, since one could split the 64 bits from Ran into two 32 bit words (not something you'd want to do with an LCG only generator!) it is effectively twice the speed.

There are clearly things that could be done to improve the ranf function, but whether it would make any difference to the quality of the results depends on how those numbers are used. I don't have enough understanding of what is going on at the next stage up the stack to be sure. Based on my research so far, though, it probably would make sense to change to something more modern. Probably a 64 bit LCG, plus output hash, in parallel with an xorshift would be enough. There is no need to use any of the NR code, since there are plenty of other references to these techniques and implementation is straightforward.

Having done a run of the test through perf, the results suggest that however lightning fast the RNG might be, it will have little effect on the overall speed of the simulation. Approx 50% of the cpu time was spent in InfectSweep and only 2.71% in ranf in my test. The first few lines of the output from perf are included here in case it is useful:

Overhead Command Shared Object Symbol

........ ......... .................................... ..............................................................................................................

# 49.24% CovidSim CovidSim [.] InfectSweep 5.78% CovidSim CovidSim [.] AssignPeopleToPlaces 5.75% CovidSim CovidSim [.] IncubRecoverySweep 5.05% CovidSim libm-2.30.so [.] __ieee754_pow_fma 4.95% CovidSim CovidSim [.] dist2 3.66% CovidSim CovidSim [.] numKernel 2.71% CovidSim CovidSim [.] ranf_mt 1.76% CovidSim CovidSim [.] DoInfect 1.57% CovidSim CovidSim [.] CalcPersonSusc 1.55% CovidSim CovidSim [.] ignbin_mt 1.43% CovidSim CovidSim [.] CalcSpatialInf 1.22% CovidSim CovidSim [.] CalcPersonInf 1.20% CovidSim CovidSim [.] dist2UTM 1.17% CovidSim CovidSim [.] CalcHouseInf 1.16% CovidSim CovidSim [.] RecordSample 1.09% CovidSim CovidSim [.] CalcPlaceInf

Anything below this line used less than 1% of the cpu time. Oh, and I did also confirm that snorm is basically the same thing as gen_norm(0, 1) having tracked down the paper about that too: https://pdfs.semanticscholar.org/8ee8/979cc90816e6b286535d198ca93794085af5.pdf

While we are on the subject of papers, and noting also comments on SampleWithoutReplacement() above, the comment in the code claims that it is based upon the SG algorithm from this paper: https://dl.acm.org/doi/pdf/10.1145/214392.214402

A quick inspection shows that only the "else" part of the big if statement there appears to be related to the SG algorithm, and even then it doesn't look quite the same. I've not had a chance to check in any detail though, so there may be more to it... thats probably enough work on this topic for this weekend :-)

weshinsley commented 4 years ago

@weatherhead99 - can you be any more specific on what you've observed? What global pointer do you refer to in SampleWithoutReplacement? The only non-local I can see is SamplingQueue[tn][...] - where tn is thread number, making it safe.

StevenWhitehouse commented 4 years ago

I've turned some of the ideas above into a pull request. It doesn't fix everything, but since this will be my first pull request, I thought it was a good plan to show what I'd got so far and get some feedback before going too far down the line. Hopefully it is useful :-)

weshinsley commented 3 years ago

Issue clean-up - closing comments:-

455 discusses the potential (I think) of making a better interface so it is easy to pick/choose RNG - I suggest any other discussion on RNGs continues there.