oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Probability Density Functions #426

Open oldoc63 opened 2 years ago

oldoc63 commented 2 years ago

Similar to how discrete random variables relate to probability mass functions, continuous random variables relate to probability density functions. They define the probability distributions of continuous random variables and span across all possible values that the given random variable can take on.

When graphed, a probability density function is a curve across all possible values the random variable can take on, and the total area under this curve adds up to 1.

The following image shows a probability density function. The highlighted area represents the probability of observing a value within the highlighted range.

oldoc63 commented 2 years ago

In a probability density function, we cannot calculate the probability at a single point. This is because the area of the curve underneath a single point is always zero.

As the interval becomes smaller, the width of the area under the curve becomes smaller as well. When trying to evaluate the area under the curve at a specific point, the width of that area becomes 0, and therefore the probability equals 0.

We can calculate the area under the curve using the cumulative distribution function for the given probability distribution.

For example, heights fall under a type of probability distribution called a normal distribution. The parameters for the normal distribution are the mean and the standard deviation, and we use the form Normal(mean, standard deviation) as shorthand.

We know that women’s heights have a mean of 167.64 cm with a standard deviation of 8 cm, which makes them fall under the Normal(167.64, 8) distribution.

Let’s say we want to know the probability that a randomly chosen woman is less than 158 cm tall. We can use the cumulative distribution function to calculate the area under the probability density function curve from 0 to 158 to find that probability.

Image

oldoc63 commented 2 years ago

We can calculate the area of the blue region in Python using the norm.cdf() method from the scipy.stats library. This method takes on 3 values:

oldoc63 commented 2 years ago

Following the same Normal(167.64, 8) distribution, assign the variable prob the probability that a randomly chosen woman is less than 175 cmt tall. You should use the stats.norm.cdf() method. Print prob.

oldoc63 commented 2 years ago

Nice YouTube: https://youtu.be/YXLVjCKVP7U

oldoc63 commented 2 years ago

Probability Density Functions and Cumulative Distribution Function

We can take the difference between two overlapping ranges to calculate the probability that a random selection will be between a range of values for continuous distributions. This is essentially the same process as calculating the probability of a range of values for discrete distributions.

https://www.evernote.com/shard/s468/sh/e8393975-2a93-9f75-ce9b-ed2cab127776/64580a2c2418b8fcb5e02bb2b829fc48

oldoc63 commented 2 years ago

Let's say we wanted to calculate the probability of randomly observing a woman between 165 cm to 175 cm, assuming heights still follow the Normal(167.74, 8) distribution. We can calculate the probability of observing these values or less. The difference between these two probabilities will be the probability of randomly observing a woman in this given range. This can be done in Python using the norm.cdf() method from the scipy.stats library. As mentioned before, this method takes on 3 values:

oldoc63 commented 2 years ago

We can also calculate the probability of randomly observing a value or greater by subtracting the probability of observing less than the given value from 1. This is possible because we know that the total area under the curve is 1, so the probability of observing something greater than a value is 1 minus the probability of observing something less than the given value.

Let's say we wanted to calculate the probability of observing a woman taller than 172 cm, assuming heights still follow the Normal(167.74, 8) distribution. We can think of this as the opposite of observing a woman shorter than 172 cm.

https://www.evernote.com/shard/s468/sh/e8393975-2a93-9f75-ce9b-ed2cab127776/64580a2c2418b8fcb5e02bb2b829fc48

oldoc63 commented 2 years ago

We can use the following code to calculate the blue area by taking 1 minus the red area:

oldoc63 commented 2 years ago

The weather in the Galapagos islands follows a Normal distribution with a mean of 20 degrees Celsius and a standard deviation of 3 degrees.

Uncomment temp_prob_1 and set the variable to equal the probability that the weather on a randomly selected day will be between 18 to 25 degrees Celsius using the norm.cdf() method.

Be sure to print temp_prob_1.

oldoc63 commented 2 years ago

Using the same information about the Galapagos Islands, uncomment temp_prob_2 and assign the variable to equal the probability that the weather on a randomly selected day will be greater than 24 degrees Celsius.

Be sure to print temp_prob_2.