modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
22.99k stars 2.59k forks source link

[BUG] [stdlib] `random` module doesn't seed random number generator automatically #2228

Open arvindavoudi opened 6 months ago

arvindavoudi commented 6 months ago

Bug description

I have encountered an issue with the random module in Mojo's standard library. Despite specifying a range, the module consistently generates a fixed number that remains the same across multiple runs of the code.

For example:

It's important to note that during bug reproduction, you might observe different numbers from those I have specified here. However, the core problem remains the same: the generated random number is persistent and does not change upon repeated execution of the Mojo code.

Steps to reproduce

from random.random import random_si64

def main():
    r = random_si64(10, 1000)
    print("Random number is:", r)

System information

- What OS did you do install Mojo on? Both MacOS & Mojo Playground
- Mojo Version: 24.2
Moosems commented 6 months ago

Very interesting.

Moosems commented 6 months ago

Looking at the implementation it seems this is actually an issue with something else as it makes an external call to "KGEN" though I have no idea what that actually is.

arvindavoudi commented 6 months ago

Exactly! I also tried to debug it but had no chance without knowing what actually is happening under the hood. As far as I checked, other random functions are corrupted as well.

Moosems commented 6 months ago

Maybe we make our own implementation in native Mojo?

arvindavoudi commented 6 months ago

That sounds great to me but as far as it's a crucial section of Mojo which is completely plagued with bugs, we may need to wait until the team gets noticed about this issue and the related corruption we experience at the moment. Accordingly, I think their implementation will be fixed quickly and obsessively before any chance for the contributors to do reverse engineering and then make a PR.

Moosems commented 6 months ago

@jackos Can we move this up in terms of importance? This seems fairly important.

soraros commented 6 months ago

If you run the following code, you shall see that it does generate a sequence of pseudo random numbers. It's the same across runs indicates that they currently don't use a "random" initial seed automatically. It is a bug worth fixing, or at least a behaviour worth documenting but I don't think it's anywhere near a "language problem".

from random.random import random_si64

def main():
  for i in range(10):
    print(random_si64(10, 1000))
Moosems commented 6 months ago

No, but it does matter and if I understood what you said earlier it affects the whole random module.

soraros commented 6 months ago

You could just call seed() inside main:

from random.random import random_si64, seed

def main():
  seed()  # could also put any Int here
  for i in range(10):
    print(random_si64(10, 1000))
Moosems commented 6 months ago

Maybe they could use the Unix time as the seed?

soraros commented 6 months ago

Maybe they could use the Unix time as the seed?

That's exactly how the nullary sort overload is implemented.

soraros commented 6 months ago

So I guess rather than a bug report, it could have been a feature request for different (better?) API design, or for better documentation, or both. WDYT, @arvindavoudi?

Moosems commented 6 months ago

I'd classify this as a bug personally

arvindavoudi commented 6 months ago

The essence of generating random numbers is to obtain entirely unique values. However, to ensure reproducibility in scenarios where consistent results are necessary, a seed is employed. This approach allows programmers to intentionally achieve identical outcomes across multiple executions. Typically, the convention in this field is to explicitly set a seed only when consistency is required, not by default. For that, programmers should just take one step further to imply consistent seed requirements themselves.

So I also classify this as an unexpected behavior which not comply with the simplicity and the tradition-ruled community (specifically Python) over the years.

That's something worrying for me as it can cause several crucial security issues if it doesn't get enough care to be understood by the programmers. Although It's an issue, lack of documentation can also escalate it.

As @soraros said, It can be fixed (by a PR) by invoking the seed method at the beginning of every related function's body but I really doubt if it's a good idea.

I'm eager to hear thoughts from the team to ensure if there is a certain reason for this or just an early implementation that needs to be considered and addressed over time.

Moosems commented 6 months ago

Well said!

soraros commented 6 months ago

The essence of generating random numbers is to obtain entirely unique values. However, to ensure reproducibility in scenarios where consistent results are necessary, a seed is employed. This approach allows programmers to intentionally achieve identical outcomes across multiple executions.

This is just wrong. Unless you have a hardware RNG or something equivalent (like system time), all random algorithms are deterministic, and therefore needs a source of entropy (seeding) to "appear random". The need itself (in contrast to "actually using a constant one") is not somehow invented to ensure reproducibility. Even the Python random module has "Generate pseudo-random numbers" in the title. But maybe I misread your message, and you were merely against requiring the use seed by default rather than seeing as an idea?

That's something worrying for me as it can cause several crucial security issues if it doesn't get enough care to be understood by the programmers. Although It's an issue, lack of documentation can also escalate it.

Considering the use of random module for security purposes should have been the thing that worries you, as the Python documentation explicitly advises against it.

As @soraros said, It can be fixed (by a PR) by invoking the seed method at the beginning of every related function's body but I really doubt if it's a good idea.

To be clear, that's exactly what Python does (not once per function but once per random number generator state object). Modulo performance concerns, what do you think "is a good idea"?

That said, I agree that the current behaviour (default to not seeding the PRNG) is surprising and most likely not what people want, so the API could use some improvement. And I see no reason it will stay this way, if only be a good citizen Python citizen.

arvindavoudi commented 6 months ago

Thanks @soraros!

you were merely against requiring the use seed by default rather than seeing as an idea?

Yes. My objection was not to the concept itself but to the mandatory use of the seed function by default. Further investigation revealed that the seed function invokes a MLIR function through the use of external_call, as found in stdlib/sys/ffi.mojo.

To address this, I propose two potential solutions:

  1. Await the introduction of expression support at the file scope level. With this enhancement, we could position seed() immediately following its declaration within stdlib/random/random.mojo. This approach ensures that a unique seed is generated as soon as the random module is imported, thus preventing the reproduction of identical random numbers in successive executions. Importantly, this modification impacts only the initial random number generation, allowing users the flexibility to apply seed() manually without interference, since the initial seeding occurs upon module importation, and subsequent seeding can be user-defined.

  2. Investigate MLIR functions for potential alternatives that we can apply, though this idea may present significant challenges.

Please note, that I'm not sure if we can somehow achieve the first idea at the moment while expressions are not supported at the file scope level or not. Any idea to work around this is highly appreciated.

Moosems commented 6 months ago

@soraros While it's great to use Python as a reference especially as Mojo is a superset, that doesn't mean that they are the source of truth and all end goals. Just because a function may not be great security wise in Python doesn't mean it can't be in Mojo :).

soraros commented 6 months ago

@Moosems I only know the absolutely basics about cryptography, but please do notice that general-purpose PRNGs and cryptographically strong ones have very different trade-offs. It's rather pointless—-or even incorrect-—to expect strong security guarantees from the former. The source of truth is not Python but rather mathematics itself. I'm not sure how to respond to your comment: it reads like Mojo can somehow make a PRNG algorithm cryptographically strong.

We could choose a different ciphers "for increased security", but that's a separate issue.

Moosems commented 6 months ago

Understood, thanks for the clarification and speedy reply!

arvindavoudi commented 6 months ago

That's a great find, @past-moon. However, the users of Mojo expect more Pythonic behaviors. Nonetheless, this is a valid point to consider when we are dealing with a modern language impregnated with hardware-level controls. I still prefer that the seed function be executed automatically after importation (to have the similar functionality we already get in Python) but based on your current explanation, I'm not sure anymore if it's still a practical/logical demand or not.

Moosems commented 5 months ago

https://discord.com/channels/1087530497313357884/1151418092052815884/1228539731852132433

Related :)

JoeLoser commented 5 months ago

Hey everyone, thanks for the lively discussion here! This looks way less than ideal from a usability perspective of us not seeding the RNG before users call random functions.

I'll bring this up with the team once I'm back from PTO (I'm out this week), and we'll prioritize digging into this. Given this and other GItHub issues I've seen, this seems like a big usability foot gun.

FYI @ConnorGray @rparolin

Moosems commented 5 months ago

Thanks Joe :)

JoeLoser commented 5 months ago

@Dan13llljws this would also be a good first issue for you to pick up that would be a huge quality of life improvement for users. Essentially, we don't seed the random number generator internally. For example, see the docs for https://en.cppreference.com/w/cpp/numeric/random/rand in how it calls std::srand() for users automatically if they don't seed themselves.

soraros commented 5 months ago

@JoeLoser Do you think we could use something similar to the new NumPy random API (which is great for scientific computing purposes, thus aligned with Mojo's goals)?

rng = np.random.default_rng()
rng = np.random.default_rng(0)  # with explicit seed

rng.random(..., dtype=...)

From an implementation point of view, we seed once per RNG instance, and don't rely on having top-level expression. From usability point of view, this means explicit/implicit seeding share the same API.