Open solomonxie opened 6 years ago
It's also called
Newcomb-Benford's Law
,Law of Anomalous Numbers
, andFirst-Digit Law
.
It is an observation about the frequency distribution of leading digits in many real-life sets of numerical data.
The first digits of data entries in most real-world data sets are not uniformly distributed. The most common first digit is 1, followed by 2, and so on, with 9 being the least common first digit. This phenomenon is known as Benford's Law
.
The leading digits in such a set thus have the following distribution:
Refer to Khan academy: Two-way tables Refer to Khan academy: Distributions in two-way tables Refer to Khan academy: Marginal distribution and conditional distribution
Refer to Mathbitsnotebook: Two-Way Frequency Tables
Two-way Table
Two-way Table
is a Joint distribution
, which rows represent a kind of distribution, columns represent another kind of distribution.
Marginal Distribution
Marginal Distribution
is simply an addon to the joint distribution, that as a TOTAL
row or column at the margins.
Conditional Distribution
Conditional Distribution
is one column(variable) in condition of another variable.
Trends in categorical data
Refer to Khan academy: Analyzing trends in categorical data Refer to Khan academy: Filling out frequency table for independent events
▶ Practice on Khan academy: Trends in categorical data
Interpret the table:
Row %
: shows how much proportion of the cell is on the Row Total
. etc., the cell Pond-Maple
is 59.46% of all samples by pond.Column %
: shows how much proportion of the cell is on the Column Total
. etc., the cell Pond-Maple
is 48.89% of all maples samples.Total %
: shows how much proportion of the cell is on the Sample Total
. etc., the cell Pond-Maple
is 27.5% of all samples.Solve:
Solve:
P(makes 1st shot) = P(makes 2nd shot)
, and P(misses 1st shot) = P(misses 2nd shot)
, regardless whether he makes or misses the 1st shot.Solve:
Which could represent the centres of a distribution.
Refer to youtube: Mean, Median, and Mode: Measures of Central Tendency: Crash Course Statistics #3 Refer to wikipedia: Central tendency Refer to Khan lecture.
Mean
is just an average of all numbers listed.Median
is the middle positioned number in a ordered number set
(means no duplicates). If there're two middles, then average them to get a median number.Mode
is the number shows up most times in a list.There're some common impact:
Average
Average
in statistics means bit different than just a arithmetic average.
Average
: In stats, it means typical
or middle
, and could be represented by multiple ways:
Arithmetic mean
: Sum numbers and get average.Median
: Sort numbers and get the MIDDLE one.Mode
: A number repeats the most times in a dataset.It's also called
Box and whisker plots
, orFive-number summary
.
▶︎ Jump over to Khan academy for practice: Comparing data distributions
Refer to Khan academy: Reading box plots Refer to Khan academy: Interpreting box plots Refer to Maths is for fun: Quartiles
Quartiles are the values that divide a list of numbers into quarters:
or
The Interquartile Range
is from Q1 to Q3:
Refer to Khan academy: Five-number summary
Box and Whisker Plot
can show all the important values.
Important values:
Although we can't find out the mean value from the Box Plot. But according to the position of the Q2 (the Median), we could know the relationship between the Mean & Median:
Interquartile
, it's probable a Normal Distribution
, which Median = Mean
Interquartile
, it's probable a Right Skewed Distribution
, which Median < Mean
Interquartile
, it's probable a Left Skewed Distribution
, which Median > Mean
At this graph below, according to the Q2 position, we know that the distribution shape is Skewed right
In mathematics and statistics, deviation is a measure of difference between the observed value of a variable and some other value, often that variable's mean.
Also called
Standard Variance
.
Refer to Khan academy review: Calculating standard deviation step by step
(▲ where ∑
means "sum of", x
is each value in the data set, μ
(mu) is the mean of the data set, and N
is the amount of data points in the population.)
Steps:
Solve:
Normal Distribution
, in which the Standard Deviation is a good measure for the spread.The Sample Variance, s²
, is used to calculate how varied a sample is,
and it's useful to estimate the Population Variance
.
Since the Sample Variance is kind of estimation, so its formula is bit different.
n-1
?Refer to Quora: Why is the formula of sample variance different from population variance?
"The sample variance is an estimator for the
population variance
. When applied to sample data, the population variance formula is a biased estimator of the population variance: it tends to UNDERESTIMATE the amount of variability. "
For solving this Underestimation problem, the statisticians found out that by dividing n-1
we will solve this problem, regards to the idea of degrees of freedom (DF)
.
This formula is better for handwriting calculation:
Sample Standard Deviation
Solve:
The age of any gorilla in our sample is likely to be closer to the average of the 4 gorillas we looked at instead of the average of all the gorillas in the zoo. Because of that, the squared deviations from the mean we calculated will probably underestimate the actual deviations from the population mean. To compensate for this underestimation, rather than simply averaging the squared deviations from the mean, we total them and divide by
n-1
.
The
Mean absolute deviation
is the absolute average of all deviations.
The deviation
is the distance from the value to the mean
value.
It's used to describe how the values looks like or how they're laid on the axis, are they close to each other or far away.
It's easy but always confusing if you haven't yet totally understood it in the first place.
The very first thing to do for solving a probability problem, is to CATEGORISE the problem and apply different formula.
A
is often written as P(A)
.P(condition)
.Common cases:
The formula Fav outcomes / Total outcomes
only gives you the Theoretical probability
.
But when you do some experiments, like flip a coin 10,000 times,
and you may find out the probability of the result of experiments is way so different than the theoretical one.
Example: Roll a die 100 times, how many times will you get a number greater than 3?
Answer: P(>3) = 3/6 *100
The probability is 50 times.
To understand probability, we really need to differentiate
independent events
anddependent events
.
Khan lecture: Compound probability of independent events
.
Coin flips
are INDEPENDENT events:
What happens in the first flip in no way affects what happens in the second flip.
And this is actually one thing that many people don't realise.
There's someone who thinks, if he got a bunch of heads in a row, then all of a sudden, it becomes more likely on the next flip to get a tails.
THAT IS NOT THE CASE.
Every flip is an independent event. What happened in the past in these flips does not affect the probabilities going forward.
A dummy method, just to draw a table or a tree shows every outcome it could be, and pick out all favourable results.
Refer to Wiki: Sample Space Refer to article: Sample Space Examples and The Counting Principle
The sample space of an experiment is all the possible outcomes for that experiment.
(Rolling Two dice)
(52 card deck)
It's also called the Size of Sample Space.
Simply to MULTIPLY.
The Fundamental Counting Principle: If there are
a
ways for one event to happen, andb
ways for a second event to happen, then there area * b
ways for both events to happen.
Sample problem: If shoes come in 6 styles with 3 possible colors, how many varieties of shoes are there? All you need to do is multiply: 6 • 3 = 18 possible varieties of shoes.
First to notice that, it's ONE event.
Aside from probability,
Permutations
andCombinations
are essential tools for statistics.
They're to solve the problem: how many groups are there of if we choose some from some.
▶︎ Back to previous note on: Intro to probability
.
▶︎ Omni Permutation Calculator
▶︎ Omni Combination Calculator
Refer to article: Easy Permutations and Combinations Refer to article: Permutations And Combinations Simplified Refer to article: Combinations vs Permutations
HOW MANY groups
do we get if we choose a number things from the total things?
e.g., how many groups would there be if we choose 3 people from 9 people?
Permutations and combinations are both to count the total number of groups
.
We got TWO types of ways to count:
Combinations could be seen as FILTERED permutations
, which filtered out all the "duplicates", or "over counted items".
e.g., We got different groups(Permutations) as "123, 132, 231, 213, 312, 321", once we filter out the over counted items,
the combination
is just one: 123
.
It's all the possible ways to arrange/order elements in a list.
(Read as N pick K)
Notice: possibilities
≠ probabilities
e.g., the possibilities
of how to arrange three numbers 1,2,3?
It could be: 123, 132, 231, 213, 312, 321
, so answer is 6 possible ways
.
To count that algebraically, it'd be 3*2*1
, answer is 6 possible ways
.
How do we do this?
Possible ways to fit in the 1st position are 3, and we got 2 left overs. Then the 2nd place could have 2 possible ways, and we got 1 left over. So the 3rd position could be 1 possible way.
And just to logically think about it, we should MULTIPLY them together to get ALL POSSIBLE WAYS: 3*2*1
.
Combination is a collection of elements which the order DOESN'T matter.
Based on permutations, we filter out the same combinations by dividing k!
to get the combinations.
(Read as N choose R)
Refer to wiki: Set Refer to Khan academy: Basic set operations
If B is a set and x is one of the objects of B, this is denoted x ∈ B, and is read as "x belongs to B", or "x is an element of B". If y is not a member of B then this is written as y ∉ B, and is read as "y does not belong to B".
If every member of set A is also a member of set B, then A is said to be a subset of B, written A ⊆ B (also pronounced A is contained in B). Equivalently, we can write B ⊇ A, read as B is a superset of A, B includes A, or B contains A.
The empty set is a subset of every set and every set is a subset of itself:
Every set is a subset of the universal set: A ⊆ U.
Basic Set Operations
Examples:
Basic properties of intersections:
Examples:
Basic properties of unions: A ∪ B = B ∪ A. A ∪ (B ∪ C) = (A ∪ B) ∪ C. A ⊆ (A ∪ B). A ∪ A = A. A ∪ U = U. A ∪ ∅ = A. A ⊆ B if and only if A ∪ B = B.
Two sets can also be "subtracted".
The relative complement
of B in A
(also called the set-theoretic difference of A and B), denoted by A \ B
(or A − B
), is the set of all elements that are members of A but not members of B.
Examples:
U \ E = E′ = O
.Basic properties of complements:
Solve:
Refer to Khan academy: Creating a histogram
Instead of plotting dots, Histogram put data of categories into BUCKETs.
Instead of pointing out each category's absolute value, sometime we need it better with each category's percentage, which Relative Frequency
will solve the problem.
Refer to Khan academy review: Stem and leaf plots review
Both Stem
and Leaf
columns represents the digits (or the place) of numbers.
In the case below, stem
shows the tenth place digit
, and leaf
shows the ones place digit
.
Refer to Khan academy: Example: Describing a distribution
Refer to Khan academy: Classifying shapes of distributions
Normal Distribution
(Symmetric distribution)Left Skewed Distribution
Right Skewed Distribution
Uniform
Bimodal Distribution
Refer to Crash course: Measures of Spread: Crash Course Statistics #4
Range
: (Highest value - Lowest value)IQR
: (Q3-Q1)Standard Deviation
: σ (sigma)Mean absolute deviation
(MAD)Mean
is just an average of all numbers listed.Median
is the middle positioned number in a ordered number set
(means no duplicates). If there're two middles, then average them to get a median number.Mode
is the number shows up most times in a list.Refer to Khan academy: Judging outliers in a dataset
In statistics, an outlier is an observation point that is distant from other observations.
That being said, outliers
in a graph are the MINORITY
of the values.
Outliers are the value fall out of the Fence
, which the Upper fence
and Lower fence
are:
We got different ways to describe the spread, centre and deviation, so we need some strategy to decide which one to use in different cases.
Mean
as centre, Standard Variance
as spreadMedian
as centre, IQR
as spreadKhan lecture: Shape for distributions.
Khan lecture 2 Clusters
, gaps
, peaks
& outliers
.
It's also called the
Unbiased estimate of population variance
.
Refer to Khan academy: Sample variance
For a large population, it's impossible to get all data. So we want to take out a number samples and calculate its variance.
The formula for Sample Variance
is a bit twist to the population variance
: let the dividing number subtract by 1, so that the variance will be slightly bigger.
It seems like some voodoo, but it's reasonable. If we use the population variance formula
for sample data, it's always gonna be underestimated.
That's why for sample variance we should do a bit change to the previous one.
Refer to Khan academy: Review and intuition why we divide by n-1 for the unbiased sample variance Refer to Khan academy: Why we divide by n-1 in variance Refer to Khan academy: Simulation showing bias in sample variance Refer to Khan academy simulation: Unbiased Estimate of Population Variance
Simulation for different variance formulas with true variance:
Before start you probably need to know: explanations of
percentiles
are quite confusing and different from each teacher teaches and different at each website you searched. Because there is NO standard definition of percentile.
Percentiles tell you what PERCENTAGE of the population has a value that's LOWER than yours.
▶︎ Jump over to have practice: Calculating percentiles
Refer to Khan academy: Calculating percentile Refer to youtube: Percentiles - Introductory Statistics Refer to youtube: Percentile Refer to textbook [PDF]: PERCENTILES AND PERCENTILE RANKS Refer to wikipedia: Percentile Refer to wikipedia: Percentile rank Refer to mathisfun: Percentiles Refer to pbarrett: percentiles (PDF) Refer to varsity tutors: percentiles
A percentile
is all values BELOW the given percentage. etc., the 20th percentile
is all values below which 20% of the observations may be found.
Percentiles are numbers from 1st to 100th, which 100th percentile means the largest value in the set. According to wiki, there COULD be decimal percentiles such as 0.13th percentile, 2.28th percentile.
For example, if your doctor tells you: your height is AT the 83% percentile of population, it means there's 83% of people are shorter than or equal to yours:
Interquartile
:
Deciles
:
Deciles are percentiles divided into 10 equal sections, which correspond to the 10th, 20th, 30th,...90th percentiles.
Percentile rank
is usually in a context of asking you to find a given value is at which percentile.
i.e., Percentile ranks are commonly used to clarify the interpretation of scores on standardized tests.
etc., you're asked what is the percentile rank of number 79 in a list, and the answer might be "Its rank is 90, because it's at the 90th percentile."
Percentiles
and Percentile Ranks
are highly similar(confusing) statistics.
Percentiles
are used to determine where to draw the line between observed values within the distribution.
(etc., a teacher wants to divide his class in half according to students' scores. And he needs to find out which score could be AT 50th percentile so that he can divide them.)Percentile rank
is kind of reversed process: It is used to determine where a particular score or value fits within a broader distribution.
(etc., A student receives a score of 75 out of 100 on an exam and wishes to determine which percentile he is at compares to the rest of the class. )The process of calculating percentiles, is actually manipulating the indexes of the number list. It's like calculating the pointer, finding out the right pointer will lead you to the number, regardless to what number it is.
There're a few methods for calculating percentiles:
Interquartile method
: For 25th = Q1, 50th = Q2 (or Median), 75th = Q3.The nearest-rank method
: The most often used method.The linear interpolation between closest ranks method
The weighted percentile method
(Index
is the value at given percentile, which , P
is the percentile, Amount
is the number of values in the list)
For cut down confusion, we use index
instead of Rank
from textbooks, which regards to the "ordinal rank" not "percentile rank".
There's a 12 numbers list, {a,b,c,d,e,f,g,h,i,j,k,l}
then 80th percentile relates to 80% of the AMOUNT of the list,
then it's 80% × 12 = 9.6
, which 9.6
is the index of the number in list.
But the index must be a whole number,
so according to the definition of percentile, the number must be equal or above 80% of all values,
that's being said, the index of number is higher than "9.6", which is the 10th
number in list.
So the 10th number in list is AT the 80th percentile, regardless what number it is.
Consider the ordered list {15, 20, 35, 40, 50}
, which contains 5 data values. What are the 5th, 30th, 40th and 100th percentiles
of this list using the nearest-rank method
?
Refer to wiki: Worked examples of the nearest-rank method
Solve:
The 5th percentile
:
5% * 5 = 0.25
, which means 5% of five numbers are below a number which index is 0.25.1
The 30th percentile
:
30% * 5 = 1.5
, which means 30% of five numbers are below a number whose index is 1.5.2
The 40th percentile
:
40% * 5 = 2
, which means 5% of five numbers are below a number whose index is 2.The 100th percentile
: is the LAST number in the list, which is 50.We use the same formula from calculating percentiles:
Instead of input the percentile to get the index, we are to input the index and get the percentile rank.
If the scores of a set of students in a math test are {20 , 30 , 15, 75}
.
What is the percentile rank
of the score 30
?
Solve:
30
is the second number, so its index is 3.So the Percentile rank
for number 30 is 75, which means it's at 75th percentile.
Refer to Khan academy: Analyzing a cumulative relative frequency graph
Z
stands forStandard Normal Distribution
. It's fairly important in real life: Japan useZ-score
on exam to estimate each student's study skills.
Z-score is the essential concept of Z-Statistics
.
▶︎ Jump over to have practice: Comparing with z-scores
Refer to Wiki: Standard score Refer to Khan academy: Z-score introduction Refer to youtube: Why Do We Need z Scores Refer to youtube: Statistics 101: Understanding Z-scores Refer to Crash Course: Z-Scores and Percentiles: Crash Course Statistics #18 Refer to youtube: z-score Calculations & Percentiles in a Normal Distribution
Z-score
is all about comparison: compare different kind of data set.
In another word, Z-score
indicates How many standard deviations
away (above or below) from the mean
to the given point.
"Z-scores in general allow us to compare things that are NOT in the same scale, as long as they are NORMALLY distributed." - CrashCourse
For example, although we know everyone's score, but by only watching those scores it's hard to know how good he is or how bad he is compare to anyone else in the dataset. etc., if most of the students score above 90, can we say someone scores 90 is good?
So Z-score
gives a solution for this: compare the score to the "average".
Z-score
is especially good to compare different type of data, etc., compare 100-score exam & 150-score exam, compare IELTS & TOFEL, compare apples & oranges, compare a baseball player & football player....
All in all, Z-score
is a process of Normalization
, which "normalize" different set of data to same standard and compare.
Compares the various grading methods in a normal distribution:
With comparing each one's score with the mean: x - μ
, we will get a kind of deviation
.
But at this point we still don't know whether each one's deviation
is big or small.
We need a "standard" to compare each deviation.
Just like the mean
is the average of all scores,
standard deviation
is the average amount of deviation of all scores, which will tell us each deviation is large or not.
So we want to compare each deviation
with the Standard deviation
: deviation ÷ 𝜎
And we get the whole picture:
Standard Score = (𝓍 - μ) / 𝜎
Assume the standard deviation is 𝜎
(sigma), so the number of it just means how much it is scaled.
etc., 2𝜎
means a doubled standard deviation
, and 1.5𝜎
means 1.5 times larger SD
.
If your Z-score is 2𝜎
, it means your score is doubled standard deviation away from the mean
.
There's some exam data of a class:
Here's their z-scores:
Solve:
(20-22)/5 = -0.4
(33-38)/12.5 = -0.4
This ONLY applies to
Normal Distribution
Refer to Khan academy: Standard normal table for proportion below
If you know someone's z-score
, you will easily get his percentile
from the Z-table
.
Vice versa, if you know his percentile
, you can get his z-score
as well.
How to use?
The 1st Row
represents the tenth decimal
of the z-score
,
the 1st Column
represents the hundredth decimal
of the z-score
.
According to the given z-score
, and search over the rows & columns to get the corresponded intersection, which is the percentile
.
etc.,
Someone's z-sore
is "0.57", and you want to know what percentile
he's at, or what proportion is below his score.
Just go over to the z-table
, first get to the row at 0.5
, and find the column of 0.7
, and the intersection will be his percentile
, which is "0.7157" or "71.57%" in this case.
Common values:
Explicit Z-table:
Solve:
z-score
of student Faisal: (103.1-105)/10 = -0.19
.Z-table
we'll get the corresponding percentile rank
: 0.4247.Solve:
(82-83.2)/8 = -0.15
, (89.2-83.2)/8 = 0.75
percentiles
: 0.4404 & 0.77340.7734 - 0.4404 = 0.333
Refer to Khan academy: Finding z-score for a percentile
Just do the other way around by looking for the given percentile cell
and then read out the corresponded column & row, that will get you the z-score.
Solve:
minimum percentile rank
is at 95, which is 0.95 in percentage.z-score
according to the percentile
:
z-score formula
: 1.65 = (x-66000)/21000
x=100650
which is the minimum annual profit.In statistics, the Population
is the collection of all people, items, or objects that are required for a specific study.
It's also called the
Population parameter
.
The word parameter
in Statistics means different than in Mathematics.
It is the number that describes the population.
It is obtained from a statistic which is calculated from a randomly selected sample of the given population.
Common population parameters
:
The word parameter often refers to the Population statistic
, etc., population mean, population SD.
The word statistic although generally refers to a fact about the data, but it also often refers to the Sample statistic
, etc., sample mean, sample proportion.
(To be written...)
Refer to Khan academy: How parameters change as data is shifted and scaled
We see that:
Solve:
5/9 * (104-32) = 40
5/9 * 2 = 1.11
Some times histograms
aren't good enough to visualize large amount of dataset. And Density Curve
plot will solve the problem, as it can take on any value in a continuum, they're not just thrown into some buckets.
Axes (Tricky):
X-axis
represents the values of data pointsY-axis
represents the proportion of certain interval, which is up to 1 (or 100%).The entire AREA under the curve is 100%, which represent all the data points.
The percentage of a interval of data points, is the AREA under the curve over the interval. NOT the height of a point.
For Symmetric distribution, the Median is right at the middle, which is at 50th percentile.
For Skewed distribution, the [Left side area] = [Right side area] = 50%.
For Symmetric distribution, the Mean is right at the middle:
For Skewed distribution, the Mean is at the right or left of the Median:
What is the height of median?
Solve:
Solve:
100%
Solve:
Solve:
[3,5]
is a triangle.1/2 * (5-3)*0.6 = 0.6 = 60%
This rule ONLY applies to
Normal Distribution
.
It's also called the 68-95-99.7% rule
, because for a normal distribution
:
Solve:
z-score
of point 32.2, (32.2-20.5)/3.9=3
which is 3 standard deviations
away from the meanempirical rule graph
, we get the percentage of 3𝜎
away from mean.0.15%
Solve:
z-score
formula, we get the two points' z-score are: -1 & 2
-1𝜎 & 2𝜎
represents 16th percentile & 97.5th percentile.97.5% - 16% = 81.5%
Just to plot many dots on the X-Y plane.
If you can fit a LINE through those points, it's linear relationship. If not, then it's non-linear.
If the scatterplot has a linear relationship:
Bivariate
is just a fancy way to say: For analyzing each point in X-Y plane, we analyze x & y SEPERATELY. etc., at point(2,3)
, including x-position is 2 and y-position 3, we analyze the x-values of all data-points, and then y-values of all data-points.
Refer to Khan academy: Bivariate relationship linearity, strength and direction
The
Correlation
is the SLOPE, and thecoefficient
of it is kind of adjustment to describe how well the slope fits the data. It's also kind of like a "Unit SLOPE" of the estimatedRegression Line
.
Refer to youtube: The Correlation Coefficient - Explained in Three Steps Refer to Khan academy: Correlation coefficient intuition Refer to Khan academy: Calculating correlation coefficient r
Correlation Coefficient
is represented as letter r
.
The interval of r
is -1 ≤ r ≤ 1
:
r=1
when the line fits ALL data points. The better the line fits the data points, the r
is closer to 1 or -1.r=0
when there's NO correlation or linear relationship
. The "worse" the line fits data, the r
is closer is closer 0.Formula:
Least-square Regression is one way of calculating Linear Regression. Most regressions' calculations are done by computer, but we want to do that by hand to have better understanding.
What is Linear Regression? Trying to fit a line as closely as possible, and as many of points as possible, is called "Linear Regression".
Refer to Khan academy: Introduction to residuals and least-squares regression
Residuals are errors. More specifically, they are the differences between the actual value of the response variable and the value predicted by the least squares regression line.
At a certain X-position, the value of residual is the VERTICAL DISTANCE from the actual value to the Regression Line.
residual
is positive, the actual point is ABOVE the regression line
,residual
is negative, the actual point is BELOW the regression line
.The way that we calculate the
Regression Line
withLeast Square
method, is to MINIMIZE the square of residuals.
Solve:
Solve:
-47 + 2*40 = 33
residual = observed - expected = 29 - 33 = -4
▶ Practice at Khan academy: Calculating the equation of the least-squares line
Refer to Khan academy: Calculating the equation of a regression line
Formula of Regression line:
Correlation Coefficient r
is kind like the Unit Slope
which is between -1 to 1
, so we have to apply the unit slope
in real case by multiply r
with the ratio of Standard Deviation of y
& x
, which is Sy/Sx
.(Ẋ, Ẏ)
. At the mean, the residual = actual
With two informations above, we can easily calculate out the estimated Regression Line.
Solve:
Refer to Khan academy: Residual plots
Linear model Residual plot:
Non-linear model Residual plot:
R-squared
means Squared Residuals, which is theSE
(Standard Error).
R squared is ALWAYS between 0 and 1, and the higher your R squared, the better.
Refer to Khan academy: R-squared or coefficient of determination
R squared is the variation of y that is explained by your linear model.
R-squared = Explained variation / Total variation
(SE_line is Standard Error from line
)
SE
(Standard Error) from the line is small -> r² close to 1 -> The line is a good fit.SE
(Standard Error) from the line is large -> r² close to 0 -> The line is not a good fitIt's just a way to keep those residuals
(difference from the regression line) positive.
And actually the residuals
or squared residuals
DOESN'T really matter to us,
because we're to MINIMIZE them anyway. Take the minimum residual or minimum residual squared doesn't matter.
(To do...)
By adding them we will get the TOTAL ERRORS, which is the one we're going to minimize.
It's also called the
Root Mean Square Deviation
(RMSD), orStandard Deviation of the Residuals
.
This method is to measure the how good the Regression Line
fits the data.
Refer to Khan academy: Standard deviation of residuals or root mean square deviation (RMSD)
Refer to Khan academy: Using least squares regression output Refer to Khan academy: Confidence interval for the slope of a regression line
Prerequisite:
For different purposes, we're to use different methods of study.
Refer to Khan academy: Types of statistical studies
Types of Statistical Studies:
Sample Study
: Sample out a portion of a LARGE POPULATION for studying on them.Observational Study
: WITHOUT affecting them, deeply observe whole (small) population.Experiments
: RANDOMLY divide samples to a Control Group
and a Treatment Group
, and compare 2 groups of which one is AFFECTED and another one NOT AFFECTED.Theresponse variable
is the focus of a question in a study or experiment.
An explanatory variable
is one that explains changes in that variable. It can be anything that might affect the response variable.
Observational study
: Measure or survey members of a sample WITHOUT trying to affect them.Controlled experiment
: Apply some treatment to one of the groups, while the other group does not receive the treatment.suggest causation
Correlation
is WEAKER than causation
treatment group
and a control group
.sample study
need a part of relative members, Observational study
need ALL members."Humans are famously bad at truly random." - Sal Khan
Refer to Khan academy: Techniques for generating a simple random sample Refer to Khan academy: Techniques for random sampling and avoiding bias
Methods of Random sampling:
Refer to Khan academy: Techniques for generating a simple random sample Refer to Wiki: Simple random sample
Solve:
Divide the population to couple of groups, and take samples from EACH group.
Refer to Wiki: Stratified sampling
Divide the population to couple of groups, and randomly take a few GROUPS from them as samples.
Refer to Wiki: Cluster sampling
Refer to Khan academy: Random sampling vs. random assignment (scope of inference)
It means that the sample was selected in such a way where each member and set of members has an equal chance of being in the sample.
▶︎ Jump over to Khan academy for practice: Bias in samples and surveys
Refer to Khan academy article: Identifying bias in samples and surveys
It occurs when people systematically give wrong answers.
It is when people chosen for the sample can't be contacted or refuse to answer.
Researcher chooses samples that are easiest to reach.
It occurs when some members in the population are left out of the sampling frame.
Researcher gives an open invitation and people decide to be in the sample or not.
Misleading people by bias words or phrases.
WITHOUT affecting them, deeply observe whole (small) population. The key is to observe.
Refer to Khan academy: Worked example identifying observational study
Observational study
DOES NOT tell the CAUSAL RELATIONSHIP
, but only to tell you if one parameter has positive correlation with another parameter or not.
RANDOMLY divide samples to a Control Group and a Treatment Group, and compare 2 groups of which one is AFFECTED and another one NOT AFFECTED.
Refer to Khan academy: Introduction to experiment design Refer to EUPATI: Clinical trial designs
The purpose is to build a CAUSAL RELATIONSHIP, which might tell you one even can cause another event, which observational study
can't tell.
The key is to divide two groups randomly, so that you will know how the affection really makes impact.
Two groups:
There're a few keys to conduct a good experiment:
Placebo
means "fake medicine", which made by sugar.
In drug testing and medical research, it's a very common way to test how mentality will affect the patient.
For conducting a medicine experiment, we randomly separate people to two groups:
Blind experiment
: All the observed people don't know which group they're in.Double Blind Experiment
: Not only the observed people, but even the conductors/administers don't know which is which.Triple Blind Experiment
: Even the people who analyze the data don't know which group they're analyzing.It's a great way to avoid BIAS.
Some times complete randomness will make things uneven, which raise the bias in experiment. etc., there're more women in one group and less in another, that affects much in the result; there're more young people in one group, that affects much as well.
So for helping to adjust this situation well, we want to introduce some improvement design for group strategy:
Block Design
Cross Over Design
Matched Pairs Design
: It is a special case of a randomized block design
.With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.
It's simply just to "switch group", which after a period of time after the experiment to do the second experiment, that let the same people in Control Group switch to Experiment Group, and the other people switch as well.
Khan academy made the wrong video named "matched pairs design" which is actually "Crossover Design". Refer to Khan academy: Crossover Design ~(Matched pairs experiment design)~
In the matched-pair design, participants are first matched in pairs according to certain characteristics. Then, each member of a pair is randomly assigned to one of the two different study subgroups. This allows comparison between similar study participants who undergo different study procedures.
"A very important idea, in science in general... Other people should be able to replicate and reinforce this experiment and hopefully get the consistent result" - Sal Khan
The
experimental probability
should get closer and closer to thetheoretical probability
after trying more and more times.
Theoretical Probability
: It's what's expected to happen based on the possible outcomes, assuming equally likely events.Experimental Probability
: It's the result of an experiment or simulation after a large number of times.The threshold: if the probability of an event is less than 5%, then it'll be called significant.
The probability of multiple events occur at the same time is the multiplication of their probabilities.
The A or B
probability is both of their favourable outcomes minus the OVERLAPS (common outcomes), which is (A + B - C)
.
The formula is:
P(A or B) = P(A) + P(B) - P(A and B)
.
Refer to Khan academy: Addition rule for probability
Solve:
Mutually exclusive events cannot happen at the same time.
Solve:
(22+33-15)/50
The problem "What's the probability of flipping coin to get 3 head in a row?" is a typical
Compound probability of independent events
problem.
The formula of Compound probability of independent events
is the same with multiplication rule.
Example: Flipping a coin three times, what's the probability of getting a tail, head and tail ?
Dependent probability means the result of second event will change because of what happened first.
Refer to Khan academy: Dependent probability introduction
Two events are INDEPENDENT to each other when:
Furthermore, with concept of Conditional Probability, two events are INDEPENDENT when:
Dependent Probability
& Independent Probability
▶ Practice on Khan academy: Dependent and independent events
Solve:
Solve:
It's NOT just both events happened, it's asking the probability of one event AFTER another event happened. It's based on a happened event, that's why you're to divide the probability of the happened event.
Probability of B given A
(B after A, or B in condition of A),
or Probability of A given B
:
Refer to youtube: Probability Part 2: Updating Your Beliefs with Bayes: Crash Course Statistics #14
Based on the concept of Set. It’s lot more intuitive to understand with a vann diagram. P(A | B) = P(A & B) / P(B) Circle of B is there for sure, proportion of A happen must be IN the circle of B, which is P(a & b). Divided by P(B) means, the proportion of A of B, means how much percentage of A space taken on B.
Imagine there are many "parallel worlds", say A-world & B-world which are the worlds A & B occur. Of course they're parallel and happening at the same time, yet there could be chance of intersection, that A occurs in B's world, or B occurs in A's world.
And the chance of one event occurs in "another world" is the Conditional probability
.
In the context of the Anime Steins;Gate, the conditional probability is chance of Mayuri being killed in Alpha-worldline.
Steins;Gate Worldline:
It means "A after B", or "A after B has happened".
Instead of happening at the same time P(A and B)
, the probability won't be the same if one has already happened.
It shows how much the A and B
covered the happened event B
.
Solve:
conditional probability
formula: P(A|B) = P(A and B) / P(B)
P(south Asia ∣ high) = P(south Asia and High) / P(high) = 7/188 ÷ 87/188 = 7/87
The Bayes' Theorem is a revolution to conditional probability.
It does not intent to do once calculation, but is a progress of improving: each time gain a little bit more confidence.
Refer to youtube: Bayes' Theorem - The Simplest Case Refer to youtube: The Bayesian Trap
The formula of Bayes' Theorem
is just a slightly extension to Conditional Probability
.
▲ Probability of A given B
and B given A
has the same numerator, that being said,
We can easily compute a conditional probability
with its reversed event.
How does it make sense?
In real life, sometimes A given B
is easy to get, sometimes B given A
is easier to get.
So whenever we encounter some difficulties of computing A given B
, we can always use probability of B given A
to compute.
Refer to youtube: Bayes' Theorem - Example: A disjoint union
Refer to youtube: Bayes' Theorem Example: Surprising False Positives
Instead of analyzing a measured distribution with explicit data, we're to abstract those analysis methods with uncertain data. It's like abstracting
arithmetic
toalgebra
.
Refer to Khan academy: Random variables
▶︎ Jump over to Khan academy for practice: Constructing probability distributions
Random Variables
are just like the unknowns in algebra. Except it's slightly different in Statistics.
Remember that: Studying Random Variables
is just like studying Algebra
over Arithmetic.
More precisely,
Random variables
are neither random nor variables. (Try to google that)
Random Variables are denoted by capital letters: X, Y, Z
Solve:
Solve:
Refer to Khan academy: Discrete and continuous random variables
"Discrete" literally means "Distinct" or "Separate" values.
The most useful one for real life is the Discrete Distribution
, and we're gonna talk about it mostly.
Refer to Wiki: Probability Distribution
It takes each Random Variable's value as an input, to form a distribution.
etc., values of a Discrete Random Variable
can form a Discrete Distribution
.
Discrete distribution
Uniform distribution
Bernoulli distribution
Normal distribution
Poisson distribution
In the case of a discrete random variable, expected value or mean — denoted as E(X)
or μx
is the long-run average outcome. To find expected value, take each value, multiply it by its respective probability, and add up all the products.
(Where the sum of all possible value of x
)
Solve:
μx = (-10*0.81) + (40*0.18( + (90*0.01) = 0
Variance (σ²):
Standard Deviation (σ):
Solve:
σ² = (1-1.75)^2*0.5 + (2-1.75)^2*0.25 + (3-1.75)^2*0.25 = 0.6889
σ = √0.689 = 0.83
Discrete distribution
: it can only take finite because numbers.Continuous distribution
: it can take an infinite number.That’s why Discrete Distribution
use histogram,
and Continuous Distribution
use density curve.
Solve:
x>3
is the area of the triangle between 3 and 5.P(x>3) = Area(3 to 5) = 1/2 * (5-3) * 0.6 = 0.6
We could use any Graphic Calculator
or online calculator, and input the Mean
, Standard Deviation
, Lower bound
, and Upper bound
.
▶︎ Online Normal Distribution Calculator
.
etc., we know the mean = 70
, SD=6
, and asked to calculate the probability of value greater than 61.
By input these values we'll get the anser:
Solve:
X<1
, we're to get the area of X<1
.
Z-score
at X=1
is necessary for calculation. Z-score = (X-μ)/σ = (1 - 1)/0.05 = 0
Z-score table
, we know that the probability of 0
is 0.5.Some basic "algebraic" operations, like adding/multiplying a number, or combining different R.V.s
The addition or subtraction of Random Variable X
will have these effects:
The scale of Random Variable X
will have these effects:
Solve:
μY = 10(μX) + 5 = 24.5
, because mean will be effected by both shift & scale.σY = 10μX = 8
, because σ will only be effected by scale.Refer to wiki: Algebra of random variables Refer to article on Khan academy: Combining random variables
Important facts about combining variances:
standard deviation
by taking square root √ of the combined variances.Difference of them
will also be normally distributed.Solve:
Remember: If both Random Variables are normally distributed, then the Difference of them will also be normally distributed.
Solve:
D
be the new Random Variable which D = X - Y
-10 < D < 10
0.57
.It's also known as the
Expectation
,Mathematical Expectation
,EV
,Average
,Mean Value
,Mean
, orFirst Moment
.
"In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents."
Solve:
-1.4 -0.3 +1.2 +2.1 = 1.6
Solve:
Solve:
▶︎ Practice at Khan academy: Making decisions with expected values
⦿ Mindset: It's better to count 1 by 1 rather than trying to apply formulas.
There're 2 ways to calculate the Expected Value
:
E(X) = ∑ Probability · value
E(X) = ∑ Relative Frequency · value
Relative Frequency
means How often something happens divided by all outcomes.
Depends on the case we're analyzing, we can choose either way to calculate expected value.
All the Relative Frequencies add up to 1.
▶︎ Jump back to previous note on: Permutation & Combination
It also means the Relevant Outcomes
, which is calculation of "n choose k" combinations.
▶︎ Jump back to previous note on: Intro to Probability
It literally means the Total Outcomes
.
For a Flip Coin problem
(Yes-No problem
), the total outcomes is 2^trails
, which means 2 * 2 * 2 ....
.
etc., the total outcomes of "flipping a coin 5 times" is 2⁵ = 32
.
▶︎ Practice at Khan academy: Expected value with calculated probabilities
Solve:
Flipping Coin problem
.
Solve:
Solve:
Solve:
Refer to Khan academy: Getting data from expected value
▶︎ Practice at Khan academy: Expected value with empirical probabilities
Solve:
Solve:
It is the
discrete probability distribution
of a random variable which takes the value1
with probability p and the value0
with probabilityq=1-p
, that is, the probability distribution of any single experiment that asks ayes–no questio
n; the question results in aboolean-valued outcome
, a single bit of information whose value issuccess/yes/true/one
with probabilityp
andfailure/no/false/zero
with probabilityq
.
Refer to wiki: Bernoulli Distribution Refer to Khan academy: Bernoulli distribution mean and variance formulas
All Bernoulli distributions are binomial distributions, but most binomial distributions are not Bernoulli distributions.
Binomial Distribution
is one of theDiscrete Distributions
.
Binomial
means "Two terms", that being said the Binomial Random Variable
is a Random Variable
contains TWO Parameters:
n
: A certain number of trails, which is a certain number.p
: A certain & constant probability of each trail being success.Refer to wiki: Binomial distribution Refer to article: Binomial Random Variables Refer to Khan academy: Binomial variables
The requirements for a random experiment to be a binomial experiment are:
n
: There is a certain total number of trails.p
: A certain & constant probability for each trail.Yes-no question
: Each trail's outcome is either success or failure.Independent
: Each trail is independent to each other.It can be simplified as: N, P, YES-NO, INDEPENDENT
To identify a binomial random variable, we also need to prove its independence. In the case of large number of trails, we can't examine each trail but only to sample out a smaller number of trails.
With Replacement Sampling, since it's NOT really independent because when you take out a sample it will affect the rest samples. But the good thing is if the base number is large enough, then your replacement won't be a big deal to affect the result.
So that's the reason we introduced the 10% Rule
, which means if the number of your samples are less than 10% of total, then we can assume each trail is independent. Because the portion is too small to affect all.
It means that the sample was selected in such a way where each member and set of members has an equal chance of being in the sample.
Sampling with replacement, means that every time you take out the sample, the total number will decrease, which affects the probability of rest samples. etc., there're 10 balls with different color, if you take out a red ball, then the probability of getting another red ball in the rest 9 balls will decrease.
Sampling with Non-replacement, means that each time you take out the sample, you put it back.
10% Rule
is a rule to assume independence between trails.
If the number of your samples are less than 10% of total, then we can assume each trail is independent. Because the portion is too small to affect all.
Refer to article on Khan academy: Binomial probability (basic) Refer to Khan academy: Generalizing k scores in n attempts
▶︎ Online Binomial Probability Calculator
We could simplify (verbal) it as:
P(X=r) = Combinations × P(yes) × P(no)
For the combinations, here's the formula:
Or use the ▶︎ Online Combination Calculator
.
Example:
Solve:
Binomial Probability Formula
, the answer is:
Expected Value = Mean = μx
Variance = Standard Deviation = σx
Solve:
Solve:
Mean = μx = np = 100 * 0.25 = 25
SD = σx = √(np(1-p)) = √(25*0.75) = 4.33
Solve:
Solve:
P(X > 3) = P(4) + P(5)
, or P(X>3) = 1 - (P(1) + P(2) + P(3))
, we're gonna use first one in this case.
Study Resources
Tools
Khan academy AP Statistics
Course Challenge
◆Machine Learning related topics