MRG: working up to standard score page

matthew-brett commented 1 year ago

@stefanv - any thoughts on direction?

pxr687 commented 1 year ago

I like the example, one thing I might add though, is a set of plots illustrating the paragraph below with some simple, idealized data (e.g. to visualise the concept before visiting it later in the chapter):

Ranks and quantile positions give an idea whether the measure is high or low compared to the other values, but they do not immediately tell us whether the measure is exceptional or unusual. To do that, we may want to ask whether the measure falls outside the typical range of values — that is, how the measure compares to the distribution of values. One common way of doing this is to re-express the measures (values) as standard scores, where the standard score for a particular value tells you how far the value is from the center of the distribution, in terms of the typical spread of the distribution. Standard values are particularly useful to allow us to compare different types of measures on a standard scale. They translate the units of measurement into standard and comparable units. We will explain this more towards the end of the chapter.

So maybe a couple of graphs, just visually motivating why you would want a standard score, which might be something like:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

set_of_scores_1 = np.random.normal(100, 15, 10000)

set_of_scores_2 = np.random.normal(600, 100, 10000)

# show separate distributions
plt.figure()
plt.hist(set_of_scores_1, color = "blue")
plt.axvline(np.sort(set_of_scores_1)[-20], color = 'cyan', label = 'Score of interest \n(Distribution 1)')
plt.plot([np.percentile(set_of_scores_1, 2.5), np.percentile(set_of_scores_1, 97.5)], [0, 0],
         color = "silver", linewidth = 10, label = "95% spread of Distribution 1")
plt.xlabel("Some Variable")
plt.legend(bbox_to_anchor = (1,1))
plt.title("Distribution 1")
plt.show()

plt.figure()
plt.hist(set_of_scores_2, color = 'orange')
plt.axvline(np.sort(set_of_scores_2)[40], color = 'red', label = 'Score of interest \n(Distribution 2)')
plt.plot([np.percentile(set_of_scores_2, 2.55), np.percentile(set_of_scores_2, 97.5)], [0, 0],
         color = "gold", linewidth = 10, label = "95% spread of Distribution 2")
plt.xlabel("Some Other Variable")
plt.legend(bbox_to_anchor = (1,1))
plt.title("Distribution 2")
plt.show()

# show why they are hard to compare
plt.figure()
plt.hist(set_of_scores_1, color = "blue", label = "Distribution 1")
plt.hist(set_of_scores_2, color = "orange", label = "Distribution 2")
plt.xlabel("Both Variables")
plt.axvline(np.sort(set_of_scores_1)[-20], color = 'cyan', label = 'Score of interest \n(Distribution 1)')
plt.axvline(np.sort(set_of_scores_2)[40], color = 'red', label = 'Score of interest \n(Distribution 2)')
plt.plot([np.percentile(set_of_scores_1, 2.5), np.percentile(set_of_scores_1, 97.5)], [0, 0],
         color = "silver", linewidth = 10, label = "95% spread of Distribution 1")
plt.plot([np.percentile(set_of_scores_2, 2.55), np.percentile(set_of_scores_2, 97.5)], [0, 0],
         color = "gold", linewidth = 10, label = "95% spread of Distribution 2")
plt.legend(bbox_to_anchor = (1,1))
plt.title("Why scores are hard to compare across these distributions")
plt.show()

# show why standard scores are useful
set_of_scores_1_standard = stats.zscore(set_of_scores_1)
set_of_scores_2_standard = stats.zscore(set_of_scores_2)
plt.figure()
plt.subplot(1, 2, 1)
plt.hist(set_of_scores_1_standard, color = "blue", label = "Distribution 1")
plt.plot([np.percentile(set_of_scores_1_standard, 2.5), np.percentile(set_of_scores_1_standard, 97.5)], [0, 0],
         color = "silver", linewidth = 10, label = "95% spread of Distribution 1")
plt.axvline(np.sort(set_of_scores_1_standard)[-20], color = 'cyan', label = 'Score of interest \n(Distribution 1)')
plt.xlabel("Some Variable")

plt.subplot(1, 2, 2)
plt.hist(set_of_scores_2_standard, color = "orange", label = "Distribution 2")
plt.xlabel("Some Other Variable")
plt.axvline(np.sort(set_of_scores_2_standard)[40], color = 'red', label = 'Score of interest \n(Distribution 2)')
plt.plot([np.percentile(set_of_scores_2_standard, 2.55), np.percentile(set_of_scores_2_standard, 97.5)], [0, 0],
         color = "gold", linewidth = 10, label = "95% spread of Distribution 2")
plt.title("Why standard scores are useful for comparing across distributions")
plt.hist([], color = "blue", label = "Distribution 1")
plt.plot([], color = 'cyan', label = 'Score of interest \n(Distribution 1)')
plt.plot([],[], color = "silver", linewidth = 10,
         label = "95% spread of Distribution 1")
plt.legend(bbox_to_anchor = (1,1))
plt.show()

I'm thinking something with hidden code (maybe just pictures of the plots). But I think plots to explain the content of that paragraph might make it easier to digest?

The district income data is nice for introducing the mechanics of the standard score, but I think that paragraph quoted above is getting at the core use of standard scores. I think the rough sequence of graphs above might be satisfying before the topic is formally revisited later in the chapter? E.g. it would show why the standard scores are useful then later show how they are achieved?

matthew-brett commented 1 year ago

@ben-herbst - this was the PR we were talking about in the call - about quantiles and standard scores.

ben-herbst commented 1 year ago

Received this email, the one with the invitation is probably still on its way.

[image: Praelexis] https://praelexis.com/ Ben Herbst Data Scientist | Machine Learning Engineer

7 Neutron Avenue ⋅ Techno Park ⋅ Stellenbosch ⋅ 7600 PO Box 3396 ⋅ Matieland ⋅ Stellenbosch ⋅ 7602 mobile: +27 83 566 4466 ⋅ office: +27 21 200 5817 website http://www.praelexis.com/| map @.,18.8270702,17z/data=!3m1!4b1!4m5!3m4!1s0x1dcdb3226a13a605:0x12022a8f60a2a6bb!8m2!3d-33.9651846!4d18.8292589> | email @.> [image: Twitter] https://twitter.com/praelexis[image: Facebook] http://www.facebook.com/praelexis[image: LinkedIn] https://www.linkedin.com/company/praelexis/ Confidentiality Note: This email may contain confidential and/or private information. If you received this email in error please delete and notify sender.

On Thu, Aug 24, 2023 at 7:54 PM Matthew Brett @.***> wrote:

@ben-herbst https://github.com/ben-herbst - this was the PR we were talking about in the call - about quantiles and standard scores.

— Reply to this email directly, view it on GitHub https://github.com/resampling-stats/resampling-with/pull/134#issuecomment-1692167241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZO3U5255FQDXHSAS3CJPSDXW6IO3ANCNFSM6AAAAAA3Y3POFU . You are receiving this because you were mentioned.Message ID: @.***>

pxr687 commented 1 year ago

Just putting this here as I think it might be a nice way to introduce what a quantile is (if that is needed) and the different methods of computing them:

import numpy as np

def get_25_50_75_100_percentile(arr):
    methods = methods = ('linear',  # The default
                        'midpoint',
                        'lower')

    print("Dataset : ", np.sort(arr))

    for method in methods:
        if method == "linear":
            print("\nMethod = ", method+" (numpy default)")
        else:
            print("\nMethod = ", method)
        print("25th percentile : ", np.quantile(arr, .25, method=method))
        print("50th percenile : ", np.quantile(arr, .50, method=method))
        print("75th percentile : ", np.quantile(arr, .75, method=method))
        print("100th percentile : ", np.quantile(arr, 1, method=method))

arr = np.arange(1, 101) # and then subsequently: np.random.poisson(10, 10).round(2)
get_25_50_75_100_percentile(arr)

matthew-brett commented 1 year ago

I worked a bit more on the quantile explanation - any thoughts?

matthew-brett commented 1 year ago

This one closer to the final shape now I think - any takers for more review?

ben-herbst commented 1 year ago

I'm looking at Chapter 3 What is probability and what can we do with it? <./what_is_probability.html> Should be able to put in a pr later today.

We also need to revise Chapter 28 Correlation and Causation. It has a dated view of causality. If that is a priority, I'll be happy to tackle it next. Otherwise anything of higher priority.

[image: Praelexis] https://praelexis.com/ Ben Herbst Data Scientist | Machine Learning Engineer

7 Neutron Avenue ⋅ Techno Park ⋅ Stellenbosch ⋅ 7600 PO Box 3396 ⋅ Matieland ⋅ Stellenbosch ⋅ 7602 mobile: +27 83 566 4466 ⋅ office: +27 21 200 5817 website http://www.praelexis.com/| map @.,18.8270702,17z/data=!3m1!4b1!4m5!3m4!1s0x1dcdb3226a13a605:0x12022a8f60a2a6bb!8m2!3d-33.9651846!4d18.8292589> | email @.> [image: Twitter] https://twitter.com/praelexis[image: Facebook] http://www.facebook.com/praelexis[image: LinkedIn] https://www.linkedin.com/company/praelexis/ Confidentiality Note: This email may contain confidential and/or private information. If you received this email in error please delete and notify sender.

On Wed, Sep 13, 2023 at 8:43 AM Matthew Brett @.***> wrote:

This one closer to the final shape now I think - any takers for more review?

— Reply to this email directly, view it on GitHub https://github.com/resampling-stats/resampling-with/pull/134#issuecomment-1717037962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZO3U5Z25SGKEUWLRLTFPKLX2FIXTANCNFSM6AAAAAA3Y3POFU . You are receiving this because you were mentioned.Message ID: @.***>

matthew-brett commented 1 year ago

Do you mind tweaking the first sentence: "Sometimes have a"?

I've rewritten the first paragraph slightly.

resampling-stats / resampling-with

MRG: working up to standard score page #134