Closed dataopt closed 2 years ago
You may not be comparing like with like. random
uses splitmix but I think python still uses Mersenne twister.
EDIT: Maybe you could post your python code also?
EDIT: I just noticed this https://github.com/haskell/random/blob/master/src/System/Random/Internal.hs#L1130 - should we really be doing floating point division when we could do multiplication?
Incidentally, the timings from numpy in python are about 40% less than the ST ones.
randomIO
nor using random
within STRef
will ever give you the optimal performance, since it relies on boxed mutable variable. If you benchmark this code that uses pure interface:
import qualified Data.Vector.Unboxed.Mutable as MV
....
randVec :: RandomGen g => g -> Int -> (V.Vector Double, g)
randVec gInit n =
runST $ do
mvec <- MV.unsafeNew n
let go g i
| i < n =
case random g of
(e, g') -> MV.unsafeWrite mvec i e >> go g' (i + 1)
| otherwise = pure g
g <- go gInit 0
vec <- V.unsafeFreeze mvec
pure (vec, g)
Then you'll get about 25% better performance. Note that the only reason we have to use this unsafe mechanism is because vector
does not provide an unfolding function that let's us keep the resulting generator.
benchmarking Direct/randomIO/100000
time 2.470 ms (2.417 ms .. 2.527 ms)
0.997 R² (0.996 R² .. 0.999 R²)
mean 2.477 ms (2.453 ms .. 2.503 ms)
std dev 85.58 μs (71.85 μs .. 105.4 μs)
variance introduced by outliers: 20% (moderately inflated)
benchmarking Direct/randomIO/1000000
time 26.80 ms (26.50 ms .. 27.09 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 25.85 ms (25.48 ms .. 26.10 ms)
std dev 695.5 μs (478.3 μs .. 971.2 μs)
benchmarking Direct/randomIO/10000000
time 253.6 ms (244.2 ms .. 263.5 ms)
0.999 R² (0.995 R² .. 1.000 R²)
mean 256.9 ms (252.2 ms .. 260.0 ms)
std dev 4.969 ms (2.189 ms .. 7.088 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking Direct/ST/100000
time 978.2 μs (973.9 μs .. 982.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 972.3 μs (968.3 μs .. 976.4 μs)
std dev 13.49 μs (11.32 μs .. 16.77 μs)
benchmarking Direct/ST/1000000
time 8.525 ms (8.488 ms .. 8.558 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 8.550 ms (8.542 ms .. 8.567 ms)
std dev 29.49 μs (16.26 μs .. 54.45 μs)
benchmarking Direct/ST/10000000
time 82.71 ms (80.52 ms .. 84.75 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 87.66 ms (84.75 ms .. 92.06 ms)
std dev 6.140 ms (29.62 μs .. 7.686 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Direct/pure/100000
time 633.6 μs (633.2 μs .. 634.1 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 633.7 μs (633.4 μs .. 634.2 μs)
std dev 1.109 μs (652.2 ns .. 1.819 μs)
benchmarking Direct/pure/1000000
time 6.356 ms (6.342 ms .. 6.366 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.350 ms (6.346 ms .. 6.355 ms)
std dev 13.65 μs (10.74 μs .. 18.17 μs)
benchmarking Direct/pure/10000000
time 62.82 ms (62.03 ms .. 63.35 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 64.67 ms (63.48 ms .. 69.36 ms)
std dev 4.024 ms (90.38 μs .. 7.163 ms)
variance introduced by outliers: 16% (moderately inflated)
I am wondering why the IO version is about twice as slow as the ST version.
ranodmIO
is slower than STRef
because it uses an atomicModifyIORef
underneath in order to be threadsafe, which has its overhead.
Here is a benchmark that uses the new stateful interface:
randVecM :: (FrozenGen g m, RandomGenM (MutableGen g m) r m) => g -> Int -> m (V.Vector Double, g)
randVecM g n = withMutableGen g (V.replicateM n . randomM)
....
bgroup
"IO"
[ bench "100000" $ nfIO (randVecM (IOGen g) 100000)
, bench "1000000" $ nfIO (randVecM (IOGen g) 1000000)
, bench "10000000" $ nfIO (randVecM (IOGen g) 10000000)
]
, bgroup
"ST"
[ bench "100000" $ nf (\n -> runST $ randVecM (STGen g) n) 100000
, bench "1000000" $ nf (\n -> runST $ randVecM (STGen g) n) 1000000
, bench "10000000" $ nf (\n -> runST $ randVecM (STGen g) n) 10000000
]
, bgroup
"Atomic"
[ bench "100000" $ nfIO (randVecM (AtomicGen g) 100000)
, bench "1000000" $ nfIO (randVecM (AtomicGen g) 1000000)
, bench "10000000" $ nfIO (randVecM (AtomicGen g) 10000000)
]
These are the results, which confirm the observation that atomic modification incurs extra overhead:
benchmarking Stateful/IO/100000
time 900.0 μs (898.8 μs .. 901.1 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 908.6 μs (906.2 μs .. 913.5 μs)
std dev 10.62 μs (6.196 μs .. 19.50 μs)
benchmarking Stateful/IO/1000000
time 8.473 ms (8.441 ms .. 8.503 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 8.484 ms (8.478 ms .. 8.493 ms)
std dev 20.11 μs (15.07 μs .. 29.85 μs)
benchmarking Stateful/IO/10000000
time 82.26 ms (80.48 ms .. 84.19 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 86.99 ms (84.20 ms .. 91.74 ms)
std dev 5.891 ms (80.78 μs .. 7.785 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Stateful/ST/100000
time 1.009 ms (1.005 ms .. 1.015 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.003 ms (1.000 ms .. 1.006 ms)
std dev 8.842 μs (6.812 μs .. 12.90 μs)
benchmarking Stateful/ST/1000000
time 8.457 ms (8.424 ms .. 8.485 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 8.479 ms (8.473 ms .. 8.489 ms)
std dev 20.84 μs (12.99 μs .. 30.09 μs)
benchmarking Stateful/ST/10000000
time 82.19 ms (80.14 ms .. 84.34 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 87.19 ms (84.23 ms .. 93.15 ms)
std dev 6.274 ms (77.31 μs .. 7.858 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Stateful/Atomic/100000
time 2.716 ms (2.708 ms .. 2.728 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.648 ms (2.625 ms .. 2.666 ms)
std dev 65.43 μs (52.37 μs .. 79.03 μs)
variance introduced by outliers: 11% (moderately inflated)
benchmarking Stateful/Atomic/1000000
time 25.71 ms (25.56 ms .. 25.84 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 25.72 ms (25.63 ms .. 26.02 ms)
std dev 333.8 μs (83.26 μs .. 634.8 μs)
benchmarking Stateful/Atomic/10000000
time 261.5 ms (243.3 ms .. 279.5 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 268.4 ms (261.7 ms .. 273.5 ms)
std dev 8.063 ms (4.818 ms .. 9.177 ms)
variance introduced by outliers: 16% (moderately inflated)
For the future, this sort of questions are better suited for StackOverflow, there will be plenty of people who will be happy to answer question like this one. Of course if further investigation points into a direction of a potential problem in the library, then by all means please open an issue.
Bonus If you are trying to beat numpy in performance, use massiv, which can create a random vector for you using all your cores:
import Data.Massiv.Array
import Data.Massiv.Array.Manifest.Vector (toVector)
....
randMassiv :: RandomGen g => Comp -> g -> Int -> (V.Vector Double, g)
randMassiv comp g n =
case split g of
(gl, gr) -> (toVector $ computeAs U $ randomArray gr split random comp (Sz1 n), gl)
...
, bgroup
"randomArray"
[ bgroup
"Seq"
[ bench "100000" $ nf (randMassiv Seq g) 100000
, bench "1000000" $ nf (randMassiv Seq g) 1000000
, bench "10000000" $ nf (randMassiv Seq g) 10000000
]
, bgroup
"Par"
[ bench "100000" $ nf (randMassiv Par g) 100000
, bench "1000000" $ nf (randMassiv Par g) 1000000
, bench "10000000" $ nf (randMassiv Par g) 10000000
]
]
Here are the results for sequential and parallel run on a 16-core Ryzen
benchmarking Massiv/randomArray/Seq/100000
time 742.3 μs (708.2 μs .. 779.1 μs)
0.985 R² (0.976 R² .. 0.993 R²)
mean 770.5 μs (740.6 μs .. 829.0 μs)
std dev 152.3 μs (72.76 μs .. 240.9 μs)
variance introduced by outliers: 93% (severely inflated)
benchmarking Massiv/randomArray/Seq/1000000
time 7.071 ms (6.818 ms .. 7.290 ms)
0.992 R² (0.987 R² .. 0.996 R²)
mean 7.001 ms (6.864 ms .. 7.169 ms)
std dev 449.4 μs (352.3 μs .. 631.3 μs)
variance introduced by outliers: 36% (moderately inflated)
benchmarking Massiv/randomArray/Seq/10000000
time 65.10 ms (61.49 ms .. 68.02 ms)
0.997 R² (0.996 R² .. 1.000 R²)
mean 65.52 ms (63.91 ms .. 71.14 ms)
std dev 4.673 ms (1.414 ms .. 7.851 ms)
variance introduced by outliers: 17% (moderately inflated)
benchmarking Massiv/randomArray/Par/100000
time 231.3 μs (202.9 μs .. 277.4 μs)
0.875 R² (0.786 R² .. 0.959 R²)
mean 249.4 μs (223.9 μs .. 281.5 μs)
std dev 83.07 μs (58.69 μs .. 113.4 μs)
variance introduced by outliers: 98% (severely inflated)
benchmarking Massiv/randomArray/Par/1000000
time 684.3 μs (645.9 μs .. 724.6 μs)
0.970 R² (0.957 R² .. 0.984 R²)
mean 778.8 μs (714.0 μs .. 937.9 μs)
std dev 323.1 μs (148.0 μs .. 623.6 μs)
variance introduced by outliers: 99% (severely inflated)
benchmarking Massiv/randomArray/Par/10000000
time 5.610 ms (5.423 ms .. 5.832 ms)
0.983 R² (0.970 R² .. 0.992 R²)
mean 5.969 ms (5.782 ms .. 6.229 ms)
std dev 664.1 μs (428.1 μs .. 1.126 ms)
variance introduced by outliers: 64% (severely inflated)
I am puzzled by the results from the following benchmarking code:
The output from my mac (12.5.1, ghc 8.10.7) is as follows (some parts removed):
I am wondering why the
IO
version is about twice as slow as theST
version. Could this be an error in the way the benchmarking code was written?Incidentally, the timings from
numpy
in python are about 40% less than theST
ones. Is there still room for performance improvement while staying in pure Haskell?