rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.6k stars 423 forks source link

Add distribution plots to `rand_distr` documentation #1434

Open MichaelOwenDyer opened 2 months ago

MichaelOwenDyer commented 2 months ago

Summary

This PR adds some graphs to the documentation of different distributions in rand_distr to illustrate their behavior.

Motivation

Make the distributions in this crate easier to understand via visual aids.

Details

The diagrams will be hosted in https://github.com/rust-random/charts (see this PR) and links will be inserted into the documentation.

Closes https://github.com/rust-random/rand/issues/131.

Documentation Progress:

MichaelOwenDyer commented 2 months ago

Need to look into it more, but maybe whatever solution we decide on here could also be used for https://github.com/rust-random/rand/issues/278

dhardy commented 2 months ago

However, this also adds a transitive dependency to syn, which is not great.

Not an issue: you already feature-gated it.

The diagrams are generated with Python code, which is included in the PR for reference.

Good to have in the same repo, but it doesn't need to be in the generated crate package. Could you try investigating why this happens? We already have an explicit include rule in the Cargo.toml, which seems like it should be enough.

We should also talk about the best place to put the diagrams (currently there is one assets folder for rand and another for rand_distr, but this might not work).

We don't package any of these files except by embedding, so we don't need an assets dir and we don't need these separated by crate, but I prefer that the generated resources are next to the generating script. Having the assets in crate-specific dirs but not the script is especially weird. Maybe just move diagrams/src to assets/src, also splitting by crate?

Or, simpler: forget Bernoulli (doesn't really need it) and put all the rest under rand_distr/plots. (Also solves my nit below.)

Plots

Bernoulli

X-axis values should be false, true; not 0, 1.

Binomial

Can we add a second parameterisation, e.g. n=6, p=0.5?

Cauchy

:+1: (straight copy from WP, I see)

newpavlov commented 2 months ago

I think SVG would be a better fit (separate files, not embedded into HTML). You should be able to generate the same plots with Python without issues.

dhardy commented 2 months ago

@newpavlov you missed the discussions in #131 (specifically, how to make this work on docs.rs).

SVG appears larger in some cases, smaller in others. It does scale better.

newpavlov commented 2 months ago

Ah, I misread the discussion and thought it was about embedding SVG into HTML. I don't think the terrible diff objection is important, plus we probably should use compression either way (i.e. use SVGZ instead of plain SVG). Though there may be some issues with browser support of SVGZ files.

In RustCrypto we use links to images which reside in the RustCrypto/media repository, e.g. see docs for the ctr crate. It means that we rely on raw.githubusercontent.com instead of packaging the image together with crate, but since the images are purely for illustrative purposes, I think it's a reasonable risk to take.

MichaelOwenDyer commented 2 months ago

Good to have in the same repo, but it doesn't need to be in the generated crate package. Could you try investigating why this happens? We already have an explicit include rule in the Cargo.toml, which seems like it should be enough.

I think it may be because the folder was named src. I've moved it and renamed it to py, could you please check again?

forget Bernoulli (doesn't really need it) and put all the rest under rand_distr/plots

Got it. That also removes the need for the added dependency on rand, now it only remains in rand_distr.

Binomial Can we add a second parameterisation, e.g. n=6, p=0.5?

I went ahead and added n=10, p=0.6 because I think it looks better next to n=10, p=0.2. If you think having a demonstration of a different n is important, then I could experiment some more. Maybe something like n=20, p=0.5 would look better.

I think SVG would be a better fit (separate files, not embedded into HTML). You should be able to generate the same plots with Python without issues.

I prefer SVG as well. However, I just did the switch and the diff count exploded to +11,000. Wowza.

newpavlov commented 2 months ago

The Python script folder should probably be explicitly excluded from packaged crate (check if it gets included by running cargo package).

It also may be worth to consider the separate repo approach described above. It will be a simpler solution than relying on embed-doc-image and will result in a smaller package size. I will leave the final decision to @dhardy.

MichaelOwenDyer commented 2 months ago

The Python script folder should probably be explicitly excluded from packaged crate (check if it gets included by running cargo package).

Ah, thanks for the tip! According to this include and exclude should not be used at the same time. But I found the problem, had to change src/ to /src :)

It also may be worth to consider the separate repo approach described above. It will be a simpler solution than relying on embed-doc-image and will result in a smaller package size. I will leave the final decision to @dhardy.

Certainly. I'm not attached to embedding them if we find a better solution.

dhardy commented 2 months ago

I prefer SVG as well. However, I just did the switch and the diff count exploded to +11,000. Wowza.

Diffs normally ignore binary files; this is just because SVG is text based? If so, switching to SVGZ may avoid this?

had to change src/ to /src

Aha!

It also may be worth to consider the separate repo approach described above.

The only real issue is versioning. I see usage of commit hashes in the URLs that ctr uses; this or tags works.

There is also the minor issue that repo clones don't include all docs, but we already have that with the book.

The other issue is that anyone trying to make themselves a copy of the docs for offline usage likely won't have the plots, but that's not critical.

Considering these plots aren't directly tied to the Rust code in any way, a separate repo probably is the way to go.

MichaelOwenDyer commented 2 months ago

Diffs normally ignore binary files; this is just because SVG is text based? If so, switching to SVGZ may avoid this?

Yes, SVGZ is better in this regard. However embed-doc-image doesn't support it.

a separate repo probably is the way to go.

Okay then! Want to create it?

dhardy commented 2 months ago

Okay then! Want to create it?

https://github.com/rust-random/charts

MichaelOwenDyer commented 2 months ago

Can you authorize me to push in that repo, or should I make a fork?

dhardy commented 2 months ago

@MichaelOwenDyer you should already have admin privileges on that repo.

MichaelOwenDyer commented 2 months ago

Hm, strange, I am getting a 403 whenever I try to push.

Edit: Nevermind, I logged in using a GitHub token and then it worked.

vks commented 2 months ago

Diffs normally ignore binary files; this is just because SVG is text based?

We can tell Git via attributes to treat the files as binary if necessary.

MichaelOwenDyer commented 2 weeks ago

Hi @dhardy @newpavlov @vks

Sorry for the huge delay in getting this PR done, I've had a lot on my plate at uni. This is ready for review now, please take a look through the docs and see if you can spot anything that needs further improvement. Looking forward to finally getting this merged!

MichaelOwenDyer commented 2 weeks ago

@dhardy, could you take a look at the docs for the Geometric distribution and tell me if you think the plot is consistent? I need a sanity check. The docs describe "the number of failures before the first success in a series of Bernoulli trials", but the x-axis on the plot starts at 1, which I think represents the number of trials before the first success (also counting the success itself). I think the plot should start at 0 for consistency with the docs and the actual implementation, no?

MichaelOwenDyer commented 2 weeks ago

Also, is there a reason we use σ for the scale parameter in Gumbel instead of β which I see in the literature?

dhardy commented 2 weeks ago

@dhardy, could you take a look at the docs for the Geometric distribution and tell me if you think the plot is consistent?

The way Wikipedia puts it is that there are two definitions of the distribution. The code and the PMF count the number of failures, while your plot and description count until the first success.

dhardy commented 2 weeks ago

Also, is there a reason we use σ for the scale parameter in Gumbel instead of β which I see in the literature?

That's presumably an error in the documentation since the code just uses scale internally.

MichaelOwenDyer commented 2 weeks ago

The way Wikipedia puts it is that there are two definitions of the distribution.

Okay, yes I saw that too. I will adjust the x-axis of the plot to start at 0.

That's presumably an error in the documentation since the code just uses scale internally.

I'll fix it then 🤠

MichaelOwenDyer commented 2 weeks ago

Another thing I noticed: We are using a in the documentation and implementation for the Zeta distribution, but s is what I am seeing online. Any reason for this?

Also, should I make an effort to embed some more links to Wikipedia across the documentation? Some already exist.

dhardy commented 2 weeks ago

Another thing I noticed: We are using a in the documentation and implementation for the Zeta distribution, but s is what I am seeing online. Any reason for this?

@vks implemented this (#1136). @vks?

Also, should I make an effort to embed some more links to Wikipedia across the documentation? Some already exist.

WP is a useful resource; I guess this would be convenient for some people but not important.

vks commented 2 weeks ago

@MichaelOwenDyer Can you be more specific? The notation and naming are not very consistent across the literature. For example, what we call Zeta is sometimes called Zipf, or vice versa.

MichaelOwenDyer commented 2 weeks ago

@vks I'm referring to the struct at zipf::Zeta. zipf::Zipf uses s already, but zipf::Zeta uses a. I didn't see anything online which lists the Zeta parameter as a, see wikipedia: https://en.wikipedia.org/wiki/Zeta_distribution

vks commented 2 weeks ago

I guess the a comes from the reference, but I no longer have access to it to check. I think I did not want to change it in the implementation to avoid getting confused when implementing the algorithm from the paper.

I'm fine with changing it to s now, as you say it's more consistent with our Zipf docs, and the usual parameter name chosen for the Riemann zeta function.

MichaelOwenDyer commented 1 week ago

Okay. We may want to do that in a different PR since ZetaError::ATooSmall is pub, and I believe renaming that would be a breaking change.