ossu / data-science

📊 Path to a free self-taught education in Data Science!
Other
19.03k stars 3.37k forks source link

Request for Comments: Data Science Curriculum v2 #61

Closed waciumawanjohi closed 4 years ago

waciumawanjohi commented 4 years ago

Problem: The curriculum has not been maintained and does not represent best practice.

Duration: 2020-08-31

Background: OSSU recommends courses that would constitute an undergraduate major in Data Science. It is our responsibility to ensure that we follow best practice. To do so, we must bring the curriculum into alignment with external guidelines. A candidate set of guidelines has been identified and previously proposed.

In 2017, the Annual Review of Statistics and Its Application published the report "Curriculum guidelines for undergraduate programs in data science." The report was authored by “25 undergraduate faculty from a variety of institutions in the United States, primarily from the disciplines of mathematics, statistics, and computer science.” It had a goal of providing “structure for institutions planning for or revising a major in data science.”

The current state of OSSU Data Science is one of disrepair. The curriculum has had 1 change in 3 years. That change deleted a link to a broken application. But there remained many links to courses that are no longer offered. A list of these can be found here. Prospective students have posted in the issues asking if the Data Science curriculum is still maintained. Updating the curriculum must ensure that all courses are available for students.

Proposal: OSSU Data Science should adopt “Curriculum guidelines for undergraduate programs in data science” (CGUPDS) as our guidelines. The curriculum should be updated to match. The exact changes can be reviewed in this pull request.

spamegg1 commented 4 years ago

It might be helpful to have a direct link to the guidelines so it's easier to read and comment: https://www.amstat.org/asa/files/pdfs/EDU-DataScienceGuidelines.pdf

Very short and nice read, just 16 pages.

Looks like you already made all the necessary course changes in the pull request. All the links are alive and the courses cover the Key Competencies on Page 6, and the Six Main Subject Areas & Outline on Page 9 very well. Really excellent work!

The topic progression path in the pull request looks somewhat different than the possible path in Figure 1, Page 12 of the guidelines. I think it's fine, but maybe you could explain a little bit for those that are curious?

One question to which I do not see an immediate answer is, where would the "Capstone Experience" and "Course in an outside discipline" mentioned in the Outline on Page 9 come from? Are they contained in some of the courses in the curriculum?

By the way I don't know much about data science at all, and I don't have a horse in this race. Just trying to be helpful.

mattjperez commented 4 years ago

For suggested changes at https://github.com/ossu/data-science/pull/60#discussion_r468003943 regarding Algorithms Part 1 and 2

These courses are completely in Java and have a steep learning curve from the start. Those coming straight from python/r/julia will have a hard time adjusting to both the course materials and programming syntax. Suggest an optional course on Java as pre-requisite, specifically Java Programming I and II by University of Helsinki. Gives college credit for Finland residents.

waciumawanjohi commented 4 years ago

I certainly think adding resources in Extras for teaching Java would be appropriate. As well, adding a note in the main curriculum that those resources are available.

The University of Helsinki courses are high quality and I have no objection to listing them as a resource.

One other option to keep in mind is Computer Science: Programming with Purpose. One thing to recommend this alternative is that it is taught by the same instructor as the Algorithm courses. This could be used instead of Introduction to Computer Science and Programming Using Python and Introduction to Computational Thinking and Data Science, or in addition to them.

mattjperez commented 4 years ago

Yeah, that's not an easy choice. The MITx pair go well together as a series just like Sedgewick's series.

On one hand, MITx uses python which is what most people will be programming with but Intro to Computational Thinking might not be as rigorous as the alternative and covers a range of things implemented in python (distributions, monte carlo, etc). This would be helpful for the DS student as it'll give more practice in a language they'll definitely encounter. It'll also reinforce topics covered in probability and statistics.

On the other, Sedgewick's will give you a very thorough understanding of algorithms specifically and the textbooks are available online, with updates and resources. Learning Java will also be good for anyone that will work at larger companies and be exposed to these types of codebases, so this route would be good for something you 'might' encounter.

Personally, I think the Sedgewick combination would be best in the CS curriculum, mainly because it's more aligned with CS than DS in my opinion as I don't think they're as necessary for machine/deep learning. They would be if you were programming the libraries themselves, but that's why I think they're more relevant for CS.

Definitely would suggest having in the DS curriculum as an extra though.

staycul commented 4 years ago

For suggested changes at #60 (comment) regarding Algorithms Part 1 and 2

These courses are completely in Java and have a steep learning curve from the start. Those coming straight from python/r/julia will have a hard time adjusting to both the course materials and programming syntax. Suggest an optional course on Java as pre-requisite, specifically Java Programming I and II by University of Helsinki. Gives college credit for Finland residents.

Head first Java might be an excellent option and beginner-friendly

jromani-ds commented 4 years ago

Regarding the Algorithms section.

The OSSU route for CS suggests the Algorithms specialization from Stanford on Coursera: https://www.coursera.org/specializations/algorithms

The DS major suggests Algorithms 1 & II from Princeton on Coursera: https://www.coursera.org/learn/algorithms-part1

Would there be value in using the same set of courses to cover algorithms between both programs?

waciumawanjohi commented 4 years ago

Would there be value in using the same set of courses to cover algorithms between both programs?

Yes, there would be. While Discord channels for the Data Science individual courses have not been added yet, they will be in the future. If Data Science and Computer Science students are in the same course, they can be in the same discussion rooms, increasing critical mass for productive peer learning.

The natural next question is: Why does the proposal include a different algorithms course for Data Science?

Essentially, computer scientists need to know more about complexity and computability than data scientists do. Some CS2013 requirements are:

These match up with the 3rd and 4th Stanford algorithms courses, which teach:

The CGUPDS, by contrast requires:

This is a decent fit for Princeton's Algorithms which teaches:

I think that curricular fit here is an overriding concern, but I'm interested to hear the opposing case.

jromani-ds commented 4 years ago

@waciumawanjohi

I think that curricular fit here is an overriding concern, but I'm interested to hear the opposing case.

I agree with your concern. The default proposed curriculum should cover the material in CGUPDS and not try to cover an inordinate amount of additional material.

I think another possibility would be to include courses that overlap in both curricula as appropriate as alternatives.

For example, in the DS curriculum list the Stanford Algorithm Specialization as an alternative for fulfilling the requirements of the program of the study and that it would also fulfill the requirements of the DS course with the caveats that the Stanford specialization covers more material and require a larger time commitment.

This may help capture benefit you mentioned

If Data Science and Computer Science students are in the same course, they can be in the same discussion rooms, increasing critical mass for productive peer learning.

If an acceptable alternative is present in the CS program, then listing it would seem to facilitate this goal.

EWCunha commented 4 years ago

Have you guys seen The Open Source Data Science Masters website, Siraj Raval - Data Sciente Youtuber Github and Data Science From Scratch? Maybe they have good guidelines and courses options for the new DS Curriculum... I don't know... just giving suggestions...

bradleygrant commented 4 years ago

I just found this RFC on Friday 8/28, and I haven't yet had a chance to deep-dive it, but in the spirit of commenting before the close date of the RFC, I have a few thoughts:

  1. My biggest criticism of the program as it's currently presented is we basically say, "So you want to learn Data Science, do ya? Great -- go take four semesters of math first!" What a disappointment. Calculus doesn't come easily for most, and to suggest you have to be competent in mathematics as a prerequisite is somewhat discouraging to those starting from zero.

    • More importantly, it's also unnecessary. One could explore the concepts of data science -- especially classification problems -- with little more than an understanding of how Euclidean distance works. (SVM, k-means, k-nearest neighbors have some fancy math behind the scenes, but understanding how they work at a naive level is no more complicated than calculating distance between two points. And you get some really cool results early on.)

    • An exploration of the concepts of data science can also get people motivated and willing to take on four to six semesters worth of math courses via independent study.

    • Doing this mirrors what we do in our computer science curriculum: give new entrants a taste of the good stuff, up front. The core of the program is the "How To Design Programs" series, but there's a reason we don't dump new entrants there first.

    • Key Recommendation: Find a suitable early-entry data analysis course to offer as a parallel offering to LAFF. MIT's 6.00.2x might fill that role.

  2. The OSSU curricula (especially for computer science) strive to be platform-agnostic, and tend to be more interested in academic rigor rather than practical skills. But in my personal view, data science is at its heart a skill-based discipline, with some sciencey aspects involved to justify assumptions with a repeatable approach. The output of a data science endeavor is inherently based on value -- a "goodness" of model fit, a suitability for a practical purpose.

    • With this in mind, should we have some thought towards practical skills classes early in the curriculum? Even if they're not in the format of full semester classes, perhaps "lab-based" short courses in day-to-day workflow tasks like repository management, data scraping/munging/cleanup/tidying, "practical" R, an exploration of Wickham's Tidyverse methods, etc.

    • This will give students something else to work on during the skill-building phase, while they're trying to climb Mount Mathematics.

    • Key Recommendation: Provide skill-based workflow classes to augment the early program experience. An intro to programming in R course, followed by a treatment of Wickham's R For Data Science book might fill that role.

waciumawanjohi commented 4 years ago

Your first point is well taken. And I suspect easily addressed. The curriculum has essentially two parallel tracks: Topic Progression Graph

As stated in the draft:

Order of the classes
Some courses can be taken in parallel, while others must be taken sequentially.
All of the courses within a topic should be taken in the order listed in the curriculum.
The graph below demonstrates how topics should be ordered.

It sounds like your first point could be addressed by simply putting listing the computer science courses first. The very course that you mention, MIT's 6.00.2x is already part of the introduction to computer science group.

On your second point, there are also practical tools and methods courses added:

Data Science Tools & Methods

I'm certainly open to suggestions for changing these courses for other ones, or for adjusting their place in the curriculum.

bradleygrant commented 4 years ago

@waciumawanjohi Thank you for your responses here. I should clarify, I was looking at the current V1 curriculum while developing my comments above. I'll take a closer look to see how these ideas are proposed to be implemented in the V2 and comment further.

waciumawanjohi commented 4 years ago

Great! It sounds like we're thinking in similar directions.

EWCunha commented 4 years ago

Can I start my first course from the V2 Curriculum or should I wait a little longer?

waciumawanjohi commented 4 years ago

@EWCunha I would recommend any student that's starting now use V2.

bradleygrant commented 4 years ago

I have reviewed the proposed CGUPDS curriculum guidelines and the candidate V2. Overall, I like the thrust and structure of the new program and I have no exceptions or recommendations for substantive changes at this time. These are good selections, and seem to fit the curriculum guidelines well.

A couple of comments:

waciumawanjohi commented 4 years ago

Close of the Comment Period

Findings: The proposal does not prepare students to work in Python or Java. The proposal does not address either a capstone experience, or recommend work in other disciplines.

Response: We should absolutely give students an approachable onramp for learning the languages of instruction. For Python, it makes sense to use Py4E, which is also the intro to programming class for the CS curriculum. For Java, the U Helsinki course is high quality and free. It has long been part of the CS curriculum's extras pages. The textbook Head First Java was mentioned, but it is not a free text.

CGUPDS makes mention of needing a course in another discipline, but gives this recommendation not even a paragraph of support. It strikes me as similar to a recommendation for a balanced liberal arts education. And while I highly value such an education, that's different from the goal of OSSU. OSSU supports the study of particular domains and leaves the rounding out of other domains as an exercise for the learner.* As such, no work in other disciplines is contained in this revision.

OSSU should recommend how students can undertake a capstone experience. I don't have an answer for this question at the moment. This is left undone. I hope that contributors can propose and discuss options, either in the Issues here or in the OSSU Discord.

Conclusion: The proposed changes will be merged in with the addition of intro courses for programming in python and java.

waciumawanjohi commented 4 years ago

...In going to add Py4E to V2, I noticed that it is already in the curriculum. oof.