pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.26k stars 1.85k forks source link

Seed RNG when using random numbers #18072

Open cbrnr opened 1 month ago

cbrnr commented 1 month ago

Description

The Contexts page in the user guide (and possibly more pages like Basic operators) uses an example data frame which contains a column with five random numbers:

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)

This code is not reproducible, which means that readers cannot compare their results with the documentation (which is especially problematic further down that page in the aggregation example).

I see three possibilities to make this example reproducible:

  1. Use random.seed(42) followed by random.uniform() from the standard library. I suggest that import random should be included in the example.
  2. Keep using NumPy, but use the recommended approach to seed the RNG and generate the random numbers using np.random.default_rng(seed=42) and rng.random(). I suggest that import numpy as np should be included in the example.
  3. Simply provide some fixed (hard-coded) "random" floating-point numbers.

I guess that for this example, it is actually not important to use actual random numbers, so I'd tend to prefer option (3), but let me know. Of course, it would be nice if Python and Rust examples were consistent, which would be easiest with option (3) as well (otherwise, the RNG would have to be seeded in the Rust code as well). I'm happy to submit a PR.

Link

https://docs.pola.rs/user-guide/concepts/contexts/

cbrnr commented 1 month ago

I just saw that the actual code in the docs is actually seeded (at least for Python, not sure about Rust), but then I think it would be helpful to show that code in the example.