ossu / data-science

📊 Path to a free self-taught education in Data Science!
Other
18.77k stars 3.26k forks source link

RFC: Overhaul Statistics #112

Open waciumawanjohi opened 11 months ago

waciumawanjohi commented 11 months ago

Summary

OSSU should undertake a search for a number of new courses in statistics.

Background

OSSU currently recommends 2 courses on statistics:

The first of these is no longer offered.

Guidelines

OSSU Data Science uses the report Curriculum Guidelines for Undergraduate Programs in Data Science as our guide for course recommendation.

Section 6 "Transitioning To A Data Science Major Using Typical Existing Courses" states:

...The courses shown in bold are the ten courses that cover the bare minimum of the basic skills needed for data science...

Subsection 6.3 "Courses in Statistics" states:

Content in the Introduction to Statistics course should follow the revised Guidelines for Assessment and Instruction in Statistics Education (GAISE) for college courses

  • Introduction to Statistics
  • Statistical Modeling/Regression
  • Machine Learning/Data Mining
  • Theory of Statistics (requires Probability Theory)

    Gaise

    For reference, the K-12 GAISE report uses a framework of 3 levels of sophistication with stats expected of K-12 students. This can be found on page 24.

The GAISE College Report includes both goals, recommendations and suggestions for topics that might be omitted.

Goals (summarized)

  1. Critique stats based results/conclusions.
  2. Recognize when statistics would be useful and carry out investigations using stats.
  3. Produce graphical displays and numerical summaries. Interpret them.
  4. Explain the role of variability in statistics.
  5. Explain the central role of randomness in designing studies and drawing conclusions.
  6. Use statistical models, including multivariable models.
  7. Understand and use hypothesis tests and interval estimation in a multiple of settings.
  8. Interpret and draw conclusions from output of statistical software packages.
  9. Demonstrate an awareness of ethical issues associated with sound statistical practice.

    Recommendations

    These are largely recommendations for how statistics courses should be taught.

  10. Teach statistical thinking
  11. Focus on conceptual understanding
  12. Integrate real data with a context and a purpose
  13. Foster active learning
  14. Use technology to explore concepts and analyze data
  15. Use assessments to improve and evaluate student learning

Suggestions for Topics that Might be Omitted from Introductory Statistics Courses

Of note, the basic statistics section reads:

Histograms, pie charts, scatterplots, means, and medians are now taught in middle and high school and are a prominent part of the Common Core State Standards in Mathematics. Classes taught to adults continuing their education or to students with a different high school background may need to spend a bit more time on basic statistics. No matter the audience, instructors will want to be sure that students truly understand these concepts, but should not dwell on them more than is necessary. Instructors may want to briefly review them to be sure terminology and notation are consistent, but this should take little time.

Assertions

  • OSSU Data Science curriculum should not recommend a descriptive stats course. This is prerequisite material; OSSU's focus is requisite material for undergraduate learners.
  • OSSU should identify a suitable Introduction to Statistics course, replacing the two current recommendations
  • After identifying the appropriate Introduction to Statistics course, OSSU should determine if a Statistical Modeling/Regression course is necessary. I would be unsurprised if a suitably rigorous Intro Course, paired with our existing ML courses prove sufficient.
  • OSSU should identify an optional Theory of Statistics course.

    Request for Comments

    This RFC is asking specifically for comments on the assertions above. Are these the right steps? Are there other implications for OSSU's curriculum that are not identified?

There will be other RFCs for carrying out the individual steps (e.g. there will be a separate RFC for Identify an Introduction to Statistics course).

waciumawanjohi commented 11 months ago

A big thank you to @reallyyy for reporting that the Descriptive Statistics course was no longer available, prompting this investigation.

waciumawanjohi commented 11 months ago

For individuals that would like to get a head start on identifying a suitable Introduction to Statistics course, below is a list of resources that you may start with. Remember:

  1. The recommendation should be a separate RFC.
  2. The RFC should take a position. I.e. the submitter should examine the courses, consider their strengths and weaknesses and make a recommendation for which is best for OSSU. Characteristics to consider:
    • Lectures are preferred but not required. If lectures are not present, lecture notes or texts that are written in a fluent manner (e.g. not notes in sentence fragments) are preferred.
    • Courses with feedback are strongly preferred. Self grading is a form of feedback (e.g. HW sets that provide solutions).
    • Beginning with a rubric for what an introductory stats class should contain may simplify the analysis effort.

MIT OCW Statistics For Applications OpenStax Introductory Statistics Textbook Saylor.org Introduction to Statistics Stanford/Coursera Introduction to Statistics MIT/edX Fundamentals of Statistics Carnegie Mellon/OLI Probability & Statistics MIT OCW Introduction To Probability And Statistics Numerous youtube playlists

waciumawanjohi commented 11 months ago

For individuals that would like to get a head start on identifying a suitable Theory of Statistics course, below is a list of resources that you may start with. The notes about analysis in the comment above apply here as well.

University of Arizona Theory of Statistics: Includes lectures and assignments, no solutions Stanford Stat 300A – Theory of Statistics: Includes handouts, HW with solutions, exams with solutions, no lectures or lecture notes Berkley Statistics 210A: Theoretical Statistics (Fall 2021) Lecture notes, HW without solutions. There is a Fall 2023 version underway. University of Minnesota Statistics 5101 Theory of Statistics I: Course for Students pursuing a BS (4101 is Theory of Stats I for students pursuing a BA) Course slides, HW and Exams without solutions, links to past course pages. MIT 9.520/6.860: Statistical Learning Theory and Applications Youtube lectures. No HW or exams. Course page

bradleygrant commented 11 months ago

There's a conflation among the assertions put forward that descriptive statistics = "basic statistics" and therefore OSSU shouldn't spend the time on it because it's prerequisite material.

In short, no.

In long, noooooooooooooooooo.

Mean, median, and mode, stem & leaf, and scatterplots together represent the entirety of statistics encountered in high school. But this is Day 1 material in a university-level descriptive statistics course (though this is also encountered in probability, and therefore these courses are typically taught jointly as an introductory probability-and-statistics course).

After they spend roughly 60% of their time just cleaning their data, practicing data scientists spend roughly the next 20% of their time doing exploratory data analysis -- which leans heavily on descriptive statistics to characterize a dataset's distribution. The importance of mean, median and mode cannot be understated -- but other values like variance, IQR, mean absolute deviation, central moments, kurtosis, scedasticity, Kolmogorov–Smirnov test scores, etc. identify key descriptive signatures of a distribution.

No, we need a descriptive statistics course.

The OSSU data science curriculum goes up through multivariate calculus. I propose as a benchmark course Georgia Tech's ISYE 6739 (co-listed as ISYE 4739 for undergraduates). This combination probability/statistics course builds on a multivariate calculus foundation at a level appropriate for motivated undergraduates without prior exposure to probability or statistics. This is a rigorous yet effective combined probability/statistics course that does a good job of covering the basics to a point sufficient for further study, even graduate study. Prof. Goldsman really hits the Goldilocks Zone here -- none too esoteric, none too powderpuff. This course includes everything you need to set up further study in data analytics or operations research.

waciumawanjohi commented 10 months ago

The importance of mean, median and mode cannot be understated -- but other values like variance, IQR, mean absolute deviation, central moments, kurtosis, scedasticity, Kolmogorov–Smirnov test scores, etc. identify key descriptive signatures of a distribution.

To be clear, the descriptive stats course did not cover the advanced topics you list. But you are correct that I conflated all descriptive stats with basic stats.

Assertion: OSSU Data Science curriculum should not recommend a basic stats course. This is prerequisite material; OSSU's focus is requisite material for undergraduate learners.

waciumawanjohi commented 10 months ago

Candidate courses for an intro to stats RFC are now:

Smcgb commented 7 months ago

Edited for clarity

Hello Everyone,

I'd like to recommend two courses for candidates in our Data Science Statistics program:

Statistical Learning with Python by Stanford University on EdX Statistical Learning by Stanford University on EdX

Both courses are based on the same content, differing only in the programming language used. They're aligned with a free book available at www.statlearning.com.

These courses offer an extensive introduction to statistical learning methods, crucial for anyone pursuing a career in data science. The authors are renowned figures in the data science community, and this book is frequently recommended on various Data Science, Machine Learning, and AI subreddits.

Why These Courses Are Beneficial:

  1. Relevance to Data Science: These courses emphasize statistical learning, an essential skill for data analysis and interpretation. They serve as an excellent bridge from basic programming and statistics to advanced model building.

  2. Curriculum Integration: They address gaps in the current curriculum with a focused approach to statistical learning techniques.

  3. Expert Instruction: Taught by leading experts, these courses are acclaimed for their clarity and depth. Larry Wasserman, a respected Professor in Statistics and Machine Learning, endorses the course book.

  4. Accessibility: Both courses are available for free on the EdX platform, and the book can be downloaded from the course website. Python labs can be found at this GitHub repository, and the course website provides direct files for both R and Python.

  5. Framework Flexibility: The program offers a choice of frameworks including PyTorch, TensorFlow, Keras, etc.

  6. Practical Application: The courses include hands-on exercises and real-world examples, ensuring practical understanding and application.

These courses are an invaluable resource for anyone aspiring to deeply understand and apply data science principles.

waciumawanjohi commented 7 months ago

@Smcgb The course describes itself as an "introductory-level course in supervised learning", so would follow an introduction to statistics.

Can you open a separate RFC to recommend the addition of this course to the curriculum? We'll leave the RFC open for 1 month for others to comment. The change looks like a positive one to me. After a month comment period we can include the course in the curriculum.

One optional edit that you can make to the RFC, is to link to some of the recommendations for the book that you mention.

Thanks for looking for ways to improve the curriculum!