numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
27.7k stars 9.95k forks source link

Decide on new PRNG BitGenerator default #13635

Closed rkern closed 5 years ago

rkern commented 5 years ago

13163 will be bringing in the long-awaited replacement of numpy's PRNG infrastructure. In the interest of keeping that PR manageable, we will merge it to master before all of the decisions are finalized, like which BitGenerator will be nominated as the default.

We must make a decision before the first release with the new infrastructure. Once released, we will be stuck with our choice for a while, so we should be sure that we are comfortable with our decision.

On the other hand, the choice of the default does not have that many consequences. We are not talking about the default BitGenerator underlying the numpy.random.* convenience functions. Per NEP 19, these remain aliases to the legacy RandomState, whose BitGenerator remains MT19937. The only place where the default comes in is when Generator() is instantiated without arguments; i.e. when a user requests a Generator with an arbitrary state, presumably to then call the .seed() method on it. This might probably be pretty rare, as it would be about as easy to just explicitly instantiate it with the seeded BitGenerator that they actually want. A legitimate choice here might actually be to nominate no default and always require the user to specify a BitGenerator.

Nonetheless, we will have recommendations as to which BitGenerator people should use most of the time, and while we can change recommendations fairly freely, whichever one has pride of place will probably get written about most in books, blogs, tutorials, and such.

IMO, there are a few main options (with my commentary, please feel free to disagree; I have not attempted to port over all the relevant comments from #13163):

No default

Always require Generator(ChosenBitGenerator(maybe_seed)). This is a little unfriendly, but as it's a pretty convenient way to get the generator properly initialized for reproducibility, people may end up doing this anyways, even if we do have a default.

MT19937

This would be a good conservative choice. It is certainly no worse than the status quo. As the Mersenne Twister is still widely regarded as "the standard" choice, it might help academic users who need their papers to be reviewed by people who might question "non-standard" choices, regardless of the specific qualities of the PRNG. "No one ever got fired for hiring IBM." The main downsides of MT19937 are mostly that it is slower than some of the available alternatives, due to its very large state, and that it fails some statistical quality tests. In choosing another PRNG, we have an opportunity (but not an obligation, IMO) to be opinionated here and try to move "the standard", if we wish.

PCG64

This is likely the one that I'll be using most often, personally. The main downside is that it uses 128-bit integer arithmetic, which is emulated in C if the compiler does not provide such an integer type. The two main platforms for which this is the case are 32-bit CPUs and 64-bit MSVC, which just does not support 128-bit integers even when the CPU does. Personally, I do not suggest letting the performance increasingly-rare 32-bit CPUs dictate our choices. But the MSVC performance is important, though, since our Windows builds do need that compiler and not other Windows compilers. It can probably be addressed with some assembly/compiler intrinsics, but someone would have to write them. The fact that it's only MSVC that we have to do this for makes this somewhat more palatable than other times when we are confronted with assembly.

Xoshiro256

Another modern choice for a small, fast PRNG. It does have a few known statistical quirks, but they are unlikely to be a major factor for most uses. Those quirks make me shy away from it, but that's my personal choice for the code I'll be writing.

charris commented 5 years ago

What does the Intel windows compiler do for 128 bit integers? How much slower is PCG64 compiled with MSVC compared to MT1993 on windows? I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

rkern commented 5 years ago

What does the Intel windows compiler do for 128 bit integers?

Not entirely sure; I don't know if there are ABI implications that ICC would care to be constrained by. If we just want to get any idea of the generated assembly that we could use, then this is a handy resource: https://godbolt.org/z/kBntXH

I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

Do you mean settable streams, rather? That's a good point, but I wonder if it might not cut the other way. If our choice of default actually matters much, then maybe if we pick one of these more fully-featured PRNGs, people will use those features more extensively in library code without documenting that they require those "advanced" features, because, after all, they are available "standard". But then if another user tries to use that library with a less-featured BitGenerator for speed or other reasons, then they'll hit a brick wall. In a No default or MT19937 world, libraries would be more likely to think about and document the advanced features that they require.

On the gripping hand, that eventuality would make the BitGenerators without settable streams look less desirable, and I do like the notion of advancing what's considered to be best practice in that direction (purely personally; I don't feel an obligation to make NumPy-the-project share that notion). It might help avoid some of the abuses that I see with people .seed()ing in the middle of their code. But again, all that's predicated on the notion that having a default will change people's behaviors significantly, so all of these concerns are likely quite attenuated.

imneme commented 5 years ago

How much slower is PCG64 compiled with MSVC compared to MT1993 on windows?

In benchmarks posted by @bashtage in #13163, PCG64 is nearly half the speed of MT19937, which is pretty disappointing performance out of MSVC and friends. It's compared to 23% faster on Linux.

What does the Intel windows compiler do for 128 bit integers?

Other compilers like Clang, GCC, and the Intel compiler implement 128-bit integers on 64-bit systems the same way that they implemented 64-bit integers on 32-bit systems. All the same techniques with no new ideas needed. Microsoft didn't bother to do that for MSVC so there are no 128-bit integers directly supported by the compiler.

As a result, for MSVC the existing implementation of PCG64 in #13163 hand-implements 128-bit math by calling-out to Microsoft intrinsics like _umul128 in x86_64 (and it could presumably also use equivalent and more portable Intel intrinsics like _mulx_u64 instead), thereby coding what GCC, Clang and the Intel compiler would do by themselves. The biggest issue is likely that Microsoft's compiler doesn't optimize these intrinsics very well (hopefully they're at least inlined?). It's possible that hand-coded assembler might go faster but the proper fix would be for the compiler not do be so diabolically poor.

I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

I'm glad you like jump ahead, but I'm curious why you think it'd be widely used. (Personally, I really like distance, which tells you how far apart two PRNGs are. That's in the C++ version of PCG, but not the C one. It'd be trivial enough to add it though if there was interest.)

charris commented 5 years ago

I'm glad you like jump ahead, but I'm curious why you think it'd be widely used.

Probably unfamiliarity with current terminology. What I mean is easily obtained independent streams that can be used to run simulations in parallel. I don't know how many simulation problems can be parallelized, but I suspect it is a lot and given the number of cores people get on a chip these days, that could easily make up for a speed disadvantage.

Microsoft didn't bother to do that for MSVC so there are no 128-bit integers directly supported by the compiler.

So that will hurt our wheels, OTOH, many folks on Windows get their packages from Anaconda or Enthought, both of which use Intel, and folks who really care about performance are probably on Linux, Mac, or maybe AIX.

EDIT: And perhaps if Microsoft is concerned, they could offer a bounty for fixing the problem.

rkern commented 5 years ago

FWIW, here's the assembly that clang would generate for the critical function, including the bits needed to unpack/repack the uint128_t into the struct of uint64_ts: https://godbolt.org/z/Gtp3os

imneme commented 5 years ago

Very cool, @rkern. Any chance you can do the same to see what MSVC is doing with the hand-written 128-bit code?

rkern commented 5 years ago

Very cool, @rkern. Any chance you can do the same to see what MSVC is doing with the hand-written 128-bit code?

It's, uh, not pretty. ~https://godbolt.org/z/a5L5Gz~

Oops, forget to add -O3, but it's still ugly: https://godbolt.org/z/yiQRhd

imneme commented 5 years ago

It's not quite that bad. You didn't have optimization on, so it didn't inline anything. I've added /Ox (maybe there's a better option?). I also fixed the code to use the built-in rotate intrinsic (_rotr64) since apparently MSVC is incapable of spotting the C rotate idiom.

Still kind-of a train wreck though. But I think it's fair to say that with a bit of attention, the PCG64 code could be tweaked to compile on MSVC into something that isn't utterly embarrassing for everyone.

eric-wieser commented 5 years ago

In order to allow everything else to be merged, why not pick "no default" for now? That leaves us free to make the decision on the default later (even after one or more release s) without breaking compatibility.

mattip commented 5 years ago

Most of our users are not random number experts, we should be providing defaults for them.

Beyond the prosaic "now they need to type more code", what happens when we change something? In the case where the BitGenerator is hard coded, (because we did not provide a default), every non-sophisticated user will now have to refactor their code, and hopefully understand the nuances of their choice (note we cannot even agree amongst ourselves what is best). However if we provide a default, we might noisily break their tests because the new default or new version is not bit-stream compatible.

Between the assumption that the bit-stream will always be constant versus the assumption that the NumPy developers know what they are doing and the default values should be best-of-brand, I would err on the side of the second assumption, even if it breaks the first.

Edit: clarify which developers should know what they are doing

rkern commented 5 years ago

Most of our users are not random number experts, we should be providing defaults for them.

Well, we'll certainly be documenting recommendations, at the very least, regardless of whether or not we have a default or what the default is.

Beyond the prosaic "now they need to type more code", what happens when we change something?

What "something" are you thinking about? I can't follow your argument.

bashtage commented 5 years ago

Beyond the prosaic "now they need to type more code", what happens when we change something?

What "something" are you thinking about? I can't follow your argument.

@mattip is referring to changing the default bit generator.

This woudl make users who have adopted it mad, and coudl require some code change.

For example, if you used

g = Generator()
g.bit_generator.seed(1234)

and the underlying bit generator was changed, then this would be wrong.

If you did the more sane thing and used

Generator(BitGenerator(1234))

then you would not see it.

IMO, When considering the choice of default, we should think it fixed until a fatal flaw is found in the underlying bit generator or Intel adds a QUANTUM_NI to its chips, which produces many OOM improvement in random performance.

imneme commented 5 years ago

I realize I'm a bit of an outsider here, but I don't think it's reasonable to expect that which PRNG is the default choice is fixed forever and never changes. (In C++, for example, std::default_random_engine is at the discretion of the implementation and can change from release to release.)

Rather, there needs to be a mechanism to reproduce prior results. Thus once a particular implementation exists, it is very uncool to change it (e.g., the MT19937 is MT19937, you can't tweak it to give different output). [And it's also uncool to remove an implementation that already exists.]

When the default changes, people who want to keep reproducing old results will need to ask for the previous default by name. (You could make that by providing a mechanism to select the default corresponding to a prior release.)

That said, even if you're allowed to swap out the default generator for something else, it really needs to be strictly better — any features present in the default generator represent a commitment to support that feature in the future. If your default generator has efficient advance, you can't really take that away later. (You could potentially wall off advanced functionality in the default generator to avoid this issue.)

In summary, there are ways to make sure uses can have reproducible results without trying to lock yourselves into a contract where the default is unchanged forever. It'll also reduce the stakes for the choice you make.

(FWIW, this is what I did in PCG. The default PCG 32-bit PRNG is currently the XSH-RR variant [accessed as pcg_setseq_64_xsh_rr_32_random_r in the C library and the pcg_engines::setseq_xsh_rr_64_32 class in the C++ library], but in principle if you really want future-proof reproducibility you should specify XSH-RR explicitly, rather than use pcg32_random_r or pcg32 which are aliases and in principle can be upgraded to something else.)

bashtage commented 5 years ago

It is not really forever (this entire project is 90% driven by a real, genuine, and honored forever promise made about 14 years ago), but as you say, switching needs (a) a compelling reason to change and (b) would take at least a few years give the depreciation cycle.

It is much better to try hard today to get it as close to right as possible.

One thing that isn't banned, of course, is improving PRNG code after release as logn as it produces the same values. For example, if we went with a PRNG that used uint128, we could let MS add uint128 support (fat chance) or add assembly for Win64 in a future version.

rkern commented 5 years ago

For example, if you used

g = Generator()
g.bit_generator.seed(1234)

and the underlying bit generator was changed, then this would be wrong.

Right, that seems to be arguing, with @eric-wieser, for the "No default" option, which I can't square with the initial statement "Most of our users are not random number experts, we should be providing defaults for them."

bashtage commented 5 years ago

Between no default and a friendly, fully assuming default, I would always choose the latter:

Now:

Generator() # OK
Generator(DefaultBitGenerator(seed)) # OK
Generator(seed) # error

my preference:

Generator(1234) == Generator(DefaultBitGenerator(1234)
Generator(*args**kwargs) == Generator(DefaultBitGenerator(*args, **kwargs))

Now I don't think this is going to get in, but I think one way to prolong the use of RandomState is to make this only available to users who feel they are expert enough to choose a bit generator.

rkern commented 5 years ago

In summary, there are ways to make sure uses can have reproducible results without trying to lock yourselves into a contract where the default is unchanged forever. It'll also reduce the stakes for the choice you make.

Yes, we have that. Users can grab BitGenerators by name (e.g. MT19937, PCG64, etc.) and instantiate them with seeds. BitGenerator objects implement the core uniform PRNG algorithm with a limited set of methods for drawing uniform [0..1) float64s and integers (as well as whatever fun jumpahead/stream capabilities they have). The Generator class that we are talking about takes a provided BitGenerator object and wraps around it to provide all of the non-uniform distributions, the Gaussians, the gammas, the binomials, etc. We have strict stream compatibility guarantees for the BitGenerators. We won't be getting rid of any (that make it to release), and nor will be changing them.

The central question about the default is "What does the code g = Generator(), with no arguments, do?" Right now, in the PR, it creates a Xoshiro256 BitGenerator with an arbitrary state (i.e. drawn from a good entropy source like /dev/urandom). The "No default" option would be to make that an error; users would have to explicitly name the BitGenerator that they want. @eric-wieser's point is that "No default" is a categorically safe option for the first release. A later release providing a default won't cause problems in the same way that changing an existing default does.

imneme commented 5 years ago

@rkern, If you only care about the no arguments case where the seed is autogenerated from available entropy, then it really doesn't matter much what the underlying generator is — it could change on an hourly basis since the results would never be reproducible (different runs would get different seeds).

In contrast, @bashtage seems to care about a default generator that's provided with a seed.

rkern commented 5 years ago

@rkern, If you only care about the no arguments case where the seed is autogenerated from available entropy, then it really doesn't matter much what the underlying generator is — it could change on an hourly basis since the results would never be reproducible (different runs would get different seeds).

You can reseed the BitGenerator after it's created. So if Generator() works, what I'm fully expecting to happen is that people who want a seeded PRNG will just seed it in the next line, as in @bashtage's example:

g = Generator()
g.bit_generator.seed(seed)

That's somewhat tedious, which is why I suggested at top that maybe most people would usually opt for Generator(PCG64(<seed>)) anyways, since it's just about as convenient typing-wise. However, @bashtage correctly notes some resistance when faced with making an extra decision.

So I guess we also have a broader question in front of us: "What are all the ways that we want users to instantiate one of these? And if those ways have default settings, what should those defaults be?" We have some open design space and @bashtage's suggestion for Generator(<seed>) or Generator(DefaultBitGenerator(<seed>)) are still possibilities.

@bashtage How much do you think documentation would help? That is, if we said at the top "PCG64 is our preferred default BitGenerator" and used Generator(PCG64(seed)) consistently in all examples (when not specifically demonstrating other algorithms)?

I might be more convinced to have a default_generator(<seed>) function over Generator(<seed>) or g=Generator();g.seed(<seed>). Then if we really needed to change it and didn't want to break stuff, we could just add a new function and add warnings to the old one. I might recommend marking it experimental for the first release, giving us some time to watch this infrastructure in the wild before making a firm commitment.

shoyer commented 5 years ago

What about actually making a DefaultBitGenerator object that doesn't expose any details of its internal state? This would be a proxy for one of the other bit generator objects, but would in principle could be wrapping any of them -- except of course for its specific sequence of generated numbers. This would hopefully discourage users from making programmatic assumptions about what they can do with the default BitGenerator, while allowing us to still use an improved algorithm.

I agree with @bashtage that it would much more friendly to directly support integer seeds as arguments to Generator, e.g., np.random.Generator(1234). This would, of course, make use of DefaultBitGenerator.

In documentation for Generator, we could give a full history of what the default bit generator was in each past version of NumPy. This is basically @imneme's suggestion, and I think would suffice for reproducibility purposes.

imneme commented 5 years ago

(Just saw this edit to an earlier comment)

Oops, forget to add -O3, but it's still ugly: https://godbolt.org/z/yiQRhd

For MSVC, it's not -O3, it's /O2 or /Ox (but not /O3!).

shoyer commented 5 years ago

In documentation for Generator, we could give a full history of what the default bit generator was in each past version of NumPy. This is basically @imneme's suggestion, and I think would suffice for reproducibility purposes.

Actually, even better would be to include an explicit version argument, like pickle's protocol argument, in Generator/DefaultBitGenerator. Then you could write something like np.random.Generator(123, version=1) to indicate that you want "version 1" random numbers (whatever that is) or np.random.Generator(123, version=np.random.HIGHEST_VERSION) (default behavior) to indicate that you want the latest/greatest bit generator (whatever that is).

Presumably version=0 would be the MT19937 that NumPy has used up to now, and version=1 could be whatever new default we pick.

rkern commented 5 years ago

What about actually making a DefaultBitGenerator object that doesn't expose any details of its internal state? This would be a proxy for one of the other bit generator objects, but would in principle could be wrapping any of them -- except of course for its specific sequence of generated numbers. This would hopefully discourage users from making programmatic assumptions about what they can do with the default BitGenerator, while allowing us to still use an improved algorithm.

Hmmm. That's appealing. It feels like it's overcomplicating things and adding another loop to this Gordian knot (and that there should be a more Alexander-esque stroke available to us), but that's really the only bad thing I have to say about it. It does make the remaining decisions easier: we can focus on statistical quality and performance.

Actually, even better would be to include an explicit version argument, like pickle, in Generator/DefaultBitGenerator.

I'm less a fan of this. Unlike the pickle case, these things have meaningful names that we can use, and we already have the mechanism implemented.

shoyer commented 5 years ago

I'm less a fan of this. Unlike the pickle case, these things have meaningful names that we can use, and we already have the mechanism implemented.

Consider the following from the perspective of a typical NumPy user:

I think it's safe to assume that most of our users know very little about the relative merits of RNG algorithms. But even without reading any docs, they can safely guess that version=1 (the newer default) must be better in most cases than version=0. For most users, that's really all they need to know.

In contrast, names like MT19937 and PCG64 are really only meaningful for experts, or people who have already read our documentation :).

rkern commented 5 years ago

In your use case, no one is selecting the version that they want. They only select the version that they need in order to replicate the results of a known version. They are always looking for a specific value that was used (implicitly, because we allowed it to be implicit) in the results that they want to replicate; they don't need to reason about the relationship between multiple values.

And in any case, that level of cross-release reproducibility is something that we've disclaimed in NEP 19. The arguments against versioning the distributions apply just as well here.

rgommers commented 5 years ago

A few thoughts on the default:

matthew-brett commented 5 years ago

Sorry to say - but 32-bit still matters on Windows - see https://github.com/pypa/manylinux/issues/118#issuecomment-481404761

matthew-brett commented 5 years ago

I think we should care a lot about statistical properties, because we are in the process of a big shift towards greater use of resampling methods in data analysis. If Python gets a reputation for being a bit sloppy on this matter, even if it's just by default, that might well be a barrier to uptake by people considering Python for data analysis. I would be very pleased if Python were the package of choice for people who take permutation and simulation seriously.

I think it's fine to offer faster not-state-of-art algorithms, but not by default, to the extent we can avoid it and maintain back-compatibility.

matthew-brett commented 5 years ago

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can be substituted for true IIDU[0,1)variables without introducing material error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for algorithms in many commonly used statistical packages, including MATLAB, Python’s random module, R, SPSS, and Stata.

@kellieotto, @pbstark - do y'all have an opinion about what PRNG we should choose here, to give the best possible basis for permutation and bootstrap?

rgommers commented 5 years ago

I think we should care a lot about statistical properties, because we are in the process of a big shift towards greater use of resampling methods in data analysis

Agreed. As long as those properties are relevant for some real-world use cases, that's very important. The concerns that are usually brought up are always extremely academic.

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Very interesting article. It does conclude that NumPy is about the only library that gets it right (top of page 9), unlike R, Python stdlib & co.

It would be very useful to get even more concrete examples than in the paper. If our current default generator also breaks down at some point, when is that? Examples like R's sample function generating 40% even numbers and 60% odd numbers when drawing ~1.7 billion samples. What's the bootstrapping/resampling equivalent here?

pbstark commented 5 years ago

The latest release of R (3.6) fixes the truncation vs. random bits approach to generating random integers. The Mersenne Twister remains the default PRNG, though.

@Kellie Ottoboni kellieotto@berkeley.edu and I think the default PRNG in scientific languages and statistics packages should be cryptographically secure (a CS-PRNG, e.g., SHA256 in counter mode), with the option to fall back to something faster but of lower quality (e.g., the Mersenne Twister) if speed requires it.

We've been working on a CS-PRNG for Python: https://github.com/statlab/cryptorandom

Performance isn't great (yet). The bottleneck seems to be type conversion within Python to cast binary strings (hash output) as integers. We're working on an implementation that moves more of the work to C.

Cheers, Philip

On Mon, May 27, 2019 at 6:27 AM Ralf Gommers notifications@github.com wrote:

I think we should care a lot about statistical properties, because we are in the process of a big shift towards greater use of resampling methods in data analysis

Agreed. As long as those properties are relevant for some real-world use cases, that's very important. The concerns that are usually brought up are always extremely academic.

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Very interesting article. It does conclude that NumPy is about the only library that gets it right (top of page 9), unlike R, Python stdlib & co.

It would be very useful to get even more concrete examples than in the paper. If our current default generator also breaks down at some point, when is that? Examples like R's sample function generating 40% even numbers and 60% odd numbers when drawing ~1.7 billion samples. What's the bootstrapping/resampling equivalent here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWJEIA4CTLLHVGZVKBLPXPOUFA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJZW4Y#issuecomment-496212851, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFDWJW445QDPGZDGXMPA3PXPOUFANCNFSM4HPX3CHA .

-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark

imneme commented 5 years ago

Performance matters to a significant fraction of the user base. Statistical properties of generators only matters to a very small fraction of users. All included generators are fine for normal use cases. So +1 for choosing the fastest one as default

First, there is simply no way to pick “the fastest”. @bashtage ran some benchmarks on the current code in #13163 and it was all over the map, with dSFMT winning on Windows and being soundly beaten by PCG-64 and Xoroshiro 256 on Linux. And this is all on the same machine with the same benchmark. Different hardware architecture (even revisions within X86) will make a difference as will different benchmarks. (As already discussed in this thread, PCG does poorly in the Windows benchmarks because of issues with MSVC, which is also likely to be a transient thing, since MSVC may improve or people may work around its issues. Probably similar MSVC issues explain why Xoshiro was beaten.)

I also wonder just how big the “significant fraction” of users who care about speed is. Python itself averages out to be about 50× slower than C. What fraction of the NumPy userbase is running it on PyPy (which would give a 4× speed boost)? Some, certainly, but I suspect not a very high number.

And for that “significant fraction” who do care about speed, given all the variability outlined above, who is going to just take your word for it that the default PRNG will run fastest for their application? A sensible thing to do (that is also quite fun and within the reach of most users) is to benchmark the different available PRNGs and see which one is fastest for them.

In contrast, although they may find clues in the documentation, figuring out the statistical quality particular PRNGs is, as you note, not on the radar of most users (and is challenging even for experts). Most won't even know when/whether they should care or not. I'd argue that this is a place for some paternalism — the fact that most users don't care about something doesn't mean that the maintainers shouldn't care about it.

It's true that all the included PRNGs are fine for most use cases, but that is a fairly low bar. Unix systems have shipped with smorgasbord of C-library PRNGs that are all statistically terrible and yet they've been widely used for years without the world spinning off its axis.

Beyond statistical properties, there are other properties that user might not know to want for themselves but I might want for them. Personally, as a provider of PRNGs I want to avoid trivial predictability — I don't want someone to look a few outputs from the PRNG and then be able to say what all future outputs will be. In most contexts where NumPy is used, predictability is not an issue — there is no adversary who will benefit from being easily able to predict the sequence. But someone somewhere is going to use NumPy's PRNGs not because they need NumPy to do statistics, but because that's where they've found PRNGs before; that code may face an actual adversary who will benefit from being able to predict the PRNG. Paying a lot (e.g., significant loss of speed) to robustly insure against this outlier situation isn't worth it, but modest insurance might be worth it.

imneme commented 5 years ago

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can be substituted for true IIDU[0,1)variables without introducing material error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for algorithms in many commonly used statistical packages, including MATLAB, Python’s random module, R, SPSS, and Stata.

FWIW, there is a nice paper by @lemire on efficiently generating a number in a range without bias. I used that as a jumping off point to explore and run some benchmarks too in my own article. (When generating 64-bits, Lemire's method does use 128-bit multiplication to avoid slow 64-bit division, with all the familiar issues that might raise for MSVC users.)

rkern commented 5 years ago

@pbstark @kellieotto I read your paper with interest when it showed up on arXiv. I was visiting some friends at BIDS, and they had mentioned your work. The Discussion section notes that "so far, we have not found a statistic with consistent bias large enough to be detected in O(10^5) replications" for MT19937. Have you found one yet? Have you found a a concrete example for a 128-bit-state PRNG like PCG64? That seems to me to be a reasonable threshold of practical relevance, where this consideration might start to outweigh others (IMO), at least for the purpose of selecting a general-purpose default.

The nice feature of our new PRNG framework #13163 is that it allows anyone to provide their own BitGenerator that can just be plugged in. It doesn't even have to be in numpy for people to use it in their numpy code. I would encourage you to look at implementing cryptorandom as a BitGenerator in C so we can compare it head to head with the other options.

seberg commented 5 years ago

Personally, I expect those that really care about speed go the extra mile if necessary (it is not much here). We should provide safe defaults, and my current best guess is that this means safe defaults for all purposes with the exception of cryptography (we probably should have a warning about that in the docs). Many users care about speed, but frankly that is exactly why I shy away from giving it too high priority. That article where Numpy did well seems interesting (kudos to Robert probably for getting it right!), but actually is the sampler not the bit generator.

@pbstark maybe you would want to implement this as a BitGenerator compatible with numpy/randomgen? That is likely both be the easiest way to speed it up and make it available to a wide audience in a much more useful form. Since it seems you and Kellie Ottoboni are in Berkeley, we could meet up some time to get that going? (Just an offer, I should have a closer look at the code myself first).

imneme commented 5 years ago

Regarding that Random Sampling: Practice Makes Imperfect paper, it's a nice read, but it's worth remembering that if we had 1 trillion cores producing one number per nanosecond for 100 years, we would generate fewer than 2^102 numbers.

For trivially predictable PRNGs (even ones with large state spaces like the Mersenne Twister), we can actually know whether some specific output sequence can ever be produced (and find the seed that would generate it if it exists or feel wistful if it doesn't), but for other non-trivially-predictable PRNGs we can't (easily) know which output sequences can never produced and which ones are there but sufficiently rare that we're vanishingly unlikely to ever find them in an eon of searching. (As you may know, I have a PRNG that I know will spit out a zip file with Twelth Night in it within 2^80 outputs, but good luck ever finding it.)

bashtage commented 5 years ago

If you really want a cryptoprng then the only choice on modern hardware is AES since it has a dedicated instruction. @lemire has an implementation here https://github.com/lemire/testingRNG/blob/master/source/aesctr.h that is as fast as noncrypto generators. There is also ChaCha20 which can go fast with SIMD. Both will be dog slow on old hardware though. ThreeFry and Philox are already included and are cryptolite counter prngs.

IMO crypto is overrated in terms of cost benefits. I'm not aware of any important retraction due to PRNG problems with Mt, which I reckon has been used in the order to 10e6 published papers. The only applications I've seen where the PRNG was really problematic were cases where the period was so small that the generator completed the full cycle. Even here the only effect was reducing the sample size of the study, which replicated the main results once rerun on a system with a larger period.

On Mon, May 27, 2019, 19:50 Robert Kern notifications@github.com wrote:

@pbstark https://github.com/pbstark @kellieotto https://github.com/kellieotto I read your paper with interest when it showed up on arXiv. I was visiting some friends at BIDS, and they had mentioned your work. The Discussion section notes that "so far, we have not found a statistic with consistent bias large enough to be detected in O(10^5) replications" for MT19937. Have you found one yet? Have you found a a concrete example for a 128-bit-state PRNG like PCG64? That seems to me to be a reasonable threshold of practical relevance, where this consideration might start to outweigh others (IMO), at least for the purpose of selecting a general-purpose default.

The nice feature of our new PRNG framework #13163 https://github.com/numpy/numpy/pull/13163 is that it allows anyone to provide their own BitGenerator that can just be plugged in. It doesn't even have to be in numpy for people to use it in their numpy code. I would encourage you to look at implementing cryptorandom as a BitGenerator in C so we can compare it head to head with the other options.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRO5PKW4MRFSBGUFUNTPXQUOLA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKLICI#issuecomment-496284681, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKTSRMRIHC4OYDR52HLTHDPXQUOLANCNFSM4HPX3CHA .

rgommers commented 5 years ago

I also wonder just how big the “significant fraction” of users who care about speed is. Python itself averages out to be about 50× slower than C. What fraction of the NumPy userbase is running it on PyPy (which would give a 4× speed boost)? Some, certainly, but I suspect not a very high number.

I suspect you're not a regular user:) NumPy is mostly C under the hood, and is as fast as doing your own thing in C (well, faster mostly). Also, PyPy is not production ready for scientific applications, and slower in any case (because it is limited to using the CPython API that NumPy uses, so it cannot gain the benefits of its JIT).

Either way, this is off-topic. Asserting that speed matter is not controversial.

bashtage commented 5 years ago

@imneme we are using lemires method for bounded integers. Since this a fresh start with no legacy or depreciation we have tried hard to start with good algorithms.

On Mon, May 27, 2019, 19:46 imneme notifications@github.com wrote:

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can be substituted for true IIDU[0,1)variables without introducing material error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for algorithms in many commonly used statistical packages, including MATLAB, Python’s random module, R, SPSS, and Stata.

FWIW, there is a nice paper https://arxiv.org/abs/1805.10941 by @lemire https://github.com/lemire on efficiently generating a number in a range without bias. I used that as a jumping off point to explore and run some benchmarks too in my own article http://www.pcg-random.org/posts/bounded-rands.html. (When generating 64-bits, Lemire's method does use 128-bit multiplication to avoid slow 64-bit division, with all the familiar issues that might raise for MSVC users.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRKNAQAK4WIXG5SVLO3PXQUA3A5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKLDRI#issuecomment-496284101, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKTSRKV3KYKRLNMBKNU4JLPXQUA3ANCNFSM4HPX3CHA .

rgommers commented 5 years ago

We should provide safe defaults, and my current best guess is that this means safe defaults for all purposes with the exception of cryptography

That's hard to argue with. My question is - what is safe? There's just varying degrees of quasi-randomness with various properties. So far I have not seen anyone give a concrete example, neither here not in other issues, PRs or threads. Just talking about abstract statistical properties isn't helpful.

charris commented 5 years ago

My sense is that PCG64 would be a good default. The speed disadvantage on Windows will not be apparent to folks using Anaconda et. al., and is likely to be fixed at some point. With parallel execution being the new thing in Python, I also think having settable streams is a desirable property.

lemire commented 5 years ago

I am highly skeptical that the PCG64 speed penalty under Visual Studio is something that cannot be wiped away.

Was this carefully assessed somewhere?

rkern commented 5 years ago

Asserting that speed matter is not controversial.

My question is - what is safe?

Apply the logic consistently: "what is fast"? I don't have a great idea what numpy programs actually have the performance of the BitGenerator as a significant bottleneck. If I use a BitGenerator that's twice as fast, will I get a 5% speed-up in my full calculation? Probably not even that. Python-not-being-as-fast-as-C is not the issue; it's just that even PRNG-heavy programs that are actually useful don't spend a huge amount of time in the BitGenerator. Probably any of the available choices are sufficient.

rkern commented 5 years ago

I am highly skeptical that the PCG64 speed penalty under Visual Studio is something that cannot be wiped away.

Up-thread I show how clang compiles PCG64 into assembly that we can steal for 64-bit MSVC, so no I don't think MSVC on 64-bit Windows is an insurmountable problem.

What may be trickier is PCG64 on 32-bit systems, of which only 32-bit Windows may still be practically important for us. In that case it's less about MSVC than about restricting ourselves to the 32-bit ISA.

pbstark commented 5 years ago

What @Kellie Ottoboni kellieotto@berkeley.edu and I point out is that even for modest-size problems, MT's state space is too small to approximate uniform permutations (n<2100) or uniform random samples (n=4e8, k=1000).

That affects everything from the bootstrap to permutation tests to MCMC. The difference between the intended distribution and the actual distribution can be arbitrarily large (total variation distance approaching 2). It's big and it's serious.

We haven't put any effort into breaking MT on "statistical" functions in a couple of years. I'm pretty sure there's a systematic way to break it (since the distributional distances are so large).

Cheers, Philip

On Mon, May 27, 2019 at 12:26 PM Robert Kern notifications@github.com wrote:

Asserting that speed matter is not controversial.

My question is - what is safe?

Apply the logic consistently: "what is fast"? I don't have a great idea what numpy programs actually have the performance of the BitGenerator as a significant bottleneck. If I use a BitGenerator that's twice as fast, will I get a 5% speed-up in my full calculation? Probably not even that. Python-not-being-as-fast-as-C is not the issue; it's just that even PRNG-heavy programs that are actually useful don't spend a huge amount of time in the BitGenerator. Probably any of the available choices are sufficient.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWKSMPAG3GFUCUFRXCDPXQYVRA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKMV3Q#issuecomment-496290542, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFDWIDCPAJJ6DJ3RO332LPXQYVRANCNFSM4HPX3CHA .

-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark

rkern commented 5 years ago

@pbstark What I'd like to see is a concrete implementation of problem (could be artificial, but not too contrived) on which MT or a 128-bit PRNG fails and cryptorandom would work on. Can you point out a dataset out there where the resampling method gives wrong inferences with a 128-bit PRNG and correct inferences with cryptorandom?

pbstark commented 5 years ago

Moving to PCG64 makes the lower bound on the size of the problem worse, since its state space is even smaller than that of MT. Of course, it could still produce "better" randomness in that it might sample a subgroup of the permutation group more evenly than MT does. But it has to break down before 500 choose 10, and before 21!.

Cheers, Philip

On Mon, May 27, 2019 at 12:30 PM Robert Kern notifications@github.com wrote:

I am highly skeptical that the PCG64 speed penalty under Visual Studio is something that cannot be wiped away.

Up-thread I show how clang compiles PCG64 into assembly that we can steal for 64-bit MSVC, so no I don't think MSVC on 64-bit Windows is an insurmountable problem.

What may be trickier is PCG64 on 32-bit systems, of which only 32-bit Windows may still be practically important for us. In that case it's less about MSVC than about restricting ourselves to the 32-bit ISA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWJFCINQCYGFCI7ULI3PXQZGLA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKM3PI#issuecomment-496291261, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFDWK6QTB65Z4TJU76XKTPXQZGLANCNFSM4HPX3CHA .

-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark

seberg commented 5 years ago

I do not know enough about PRNGs to really weigh in any case, I just want the focus to be on the statistical properties first (if the answer is that they are all very very good, fine). One thing that I wonder now is the k-dimensional equidistribution. Do we currently use variants of say PCG that do well here compared to MT? (Coming from nonlinear dynamics, that makes me a bit nervous, but I do not have enough overview over PRNGs and I will not get it in the next 2 days...

matthew-brett commented 5 years ago

It seems unlikely that there are many Windows 32-bit users out there who care about cutting edge performance. It doesn't take much effort to switch to 64-bits.

pbstark commented 5 years ago

I'd like to see it too.

We know--on the basis of the math--that there must be many large problems, but we can't point to an example yet.

The precautionary principle would say that since we know there are large problems and we know how to prevent them (CS-PRNGs), we might as well do that by default, and let users be less cautious if they choose to be.

On Mon, May 27, 2019 at 12:39 PM Robert Kern notifications@github.com wrote:

@pbstark https://github.com/pbstark What I'd like to see is a concrete implementation of problem (could be artificial, but not too contrived) on which MT or a 128-bit PRNG fails and cryptorandom would work on. Can you point out a dataset out there where the resampling method gives wrong inferences with a 128-bit PRNG and correct inferences with cryptorandom?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWODTUIPNMVOJB6QP3DPXQ2FPA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKNFCI#issuecomment-496292489, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFDWITAGQFZDQSIFNEHETPXQ2FPANCNFSM4HPX3CHA .

-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark