Performance compared to Rust randomkit

rohitjoshi commented 6 years ago

Some of the distributions are slower compared to Randomkit. Randomkit is a FFI wrapper around numpy's randomstat implementation.

e.g . See the below benchmark. Any thoughts?

test statrs::dirichlet_rust            ... bench:         872 ns/iter (+/- 302)
test randkit::tests::dirichlet_rust    ... bench:         448 ns/iter (+/- 63)

test statrs::sigma_rust                ... bench:         424 ns/iter (+/- 95)
test randkit::tests::sigma_rust        ... bench:         351 ns/iter (+/- 52)

statrs:

fn dirichlet(num_mixture: usize) -> Vec<f64> {
  let mut alpha = Vec::with_capacity(.num_mixture);
  for i in 0..num_mixture {
    alpha.push(1.0f64 / self.prior_conf.num_mixture as f64);
  }
  let mut d = Dirichlet::new(alpha.as_slice()).unwrap();
  let mut rng = rand::thread_rng();
  d.sample(&mut rng)
}

fn sigma(num_mixture: usize) -> Vec<f64> {
  let nu = 3.0f64;
  let shape = 0.5f64 * nu;
  let scale = 1.0f64 / (0.5f64 * nu * 1.0f64);
  let mut v = Vec::with_capacity(num_mixture);
  let mut g = Gamma::new(shape, scale).unwrap();
  let mut rng = rand::thread_rng();
  for i in 0..num_mixture {
    v.push(1.0f64 / g.sample(&mut rng));
  }
  return v;
}

randomkit:

fn dirichlet(num_mixture: usize) -> Vec<f64> {
  let mut alpha = Vec::with_capacity(num_mixture);
  for i in 0..num_mixture {
    alpha.push(1.0f64 / num_mixture as f64);
  }
  let mut d = Dirichlet::new(alpha).unwrap();
  let rng: Rng = Rng::new().unwrap();
  d.sample(&mut rng)
}

fn sigma(num_mixture: usize) -> Vec<f64> {
  let nu = 3.0f64;
  let shape = 0.5f64 * nu;
  let scale = 1.0f64 / (0.5f64 * nu * 1.0f64);
  let mut v = Vec::with_capacity(num_mixture);
  let mut g = Gamma::new(shape, scale).unwrap();
  let rng: Rng = Rng::new().unwrap();
  for i in 0..num_mixture {
    v.push(1.0f64 / g.sample(&mut rng));
  }
  return v;
}

bheklilr commented 6 years ago

I think we would have to dig down into these functions to find what the actual difference is coming from. In dirichlet, is it from the as_slice call from creating Dirichlet vs randomkit's version that doesn't require as_slice? Is the difference in d.sample(&mut rng) itself? I'm less worried by the difference in sigma since that's only 75ns difference, but since dirichlet is almost twice as slow with a large range (+/- 302) that seems like we would be able to make some improvements more easily.

boxtown commented 6 years ago

If I had to guess, it would be that the sampling function for the gamma distribution in NumPy is more optimized than ours. Either that or the fact that we are doing FP division in a loop on https://github.com/boxtown/statrs/blob/master/src/distribution/dirichlet.rs#L147 instead of multiplying by the inverse. I'll do some digging when I have the time but anyone is free to submit PRs for optimizations

rohitjoshi commented 6 years ago

Vector index is used to assign/update value. Maybe by using iterator would avoid the bound check.

https://github.com/boxtown/statrs/blob/master/src/distribution/dirichlet.rs#L143

rohitjoshi commented 6 years ago

Randomkit uses need!(alpha.iter().all(|a| *a > 0.0), "need all alpha > 0.0"); for alpha verification vs statrs uses is_valid_multinomial() : https://github.com/boxtown/statrs/blob/master/src/distribution/internal.rs#L4

boxtown commented 6 years ago

There is definitely an opportunity for optimization by replacing vector indexing with either iterator use or unsafe indexing, although it seems like Randomkit also uses vector indexing. With regards to the verification check however, I think I'd personally prefer the stronger guarantees (i.e. no NaNs) than any possible speed ups (which I'm assuming to be pretty nominal though I could be wrong). I'd be willing to reconsider though if there were strong evidence that the check itself is the root cause of the performance difference

rohitjoshi commented 6 years ago

I replaced index with iterator but don't see much difference. Also, alpha validation shows very little improvement so seems bottleneck is elsewhere.

boxtown commented 6 years ago

Another possible case is that the RNG used by Randomkit is less computationally intensive than the one from the Rust rand crate

vks commented 6 years ago

Indeed Randomkit uses the Mersenne Twister instead of an unpredictable crypto RNG like rand does by default. It is amazing that rand is still competitive! If you want to compare statrs and Randomkit, you should use the same RNG. I would suggest to redo the benchmark using mersenne_twister or by implementing rand::Rng for randomkit::Rng.

rohitjoshi commented 6 years ago

Thanks. I think numpy is primarily used for ML/stats and doesn't require crypto random.

With mt19937 and sfmt, I see significant improvement and outperforms numpy for sigma test case but still slower for Dirichlet.

test statrs::sigma_rust                      ... bench: 134,685,636 ns/iter (+/- 31,823,013)
test statrs::sigma_mt19937_rust              ... bench:  37,462,327 ns/iter (+/- 4,305,331)
test statrs::sigma_sfmt_rust                 ... bench:  35,474,413 ns/iter (+/- 9,619,130)

test randkit:::sigma_rust                    ... bench: 109,653,185 ns/iter (+/- 11,162,713)

test statrs::dirichlet_rust                  ... bench: 220,099,086 ns/iter (+/- 13,985,720)
test statrs::dirichlet_mt19937_rust          ... bench: 149,702,831 ns/iter (+/- 18,926,881)
test statrs::dirichlet_sfmt_rust             ... bench: 129,807,223 ns/iter (+/- 8,436,193)

test randkit::dirichlet_rust                 ... bench:  98,681,855 ns/iter (+/- 20,803,202)

dhardy commented 6 years ago

Suggestion: why not compare instead to an optimised C library, namely the GNU Scientific Library?

There are some optimisations in the GSL not in Statrs (specifically around the Gamma distribution, which I don't understand well enough to check correctness of).

statrs-dev / statrs

Performance compared to Rust randomkit #83