uc-python / advanced-python-datasci

Advanced Python for Data Science Workshop
https://uc-python.github.io/advanced-python-datasci/
MIT License
13 stars 5 forks source link

Outline revised curriculum #16

Closed bradleyboehmke closed 3 years ago

bradleyboehmke commented 3 years ago

We have decided to modify the proposed outline and curriculum to be more of an end-to-end [machine learning / data science] project development with 50% of time spent on the machine learning aspects and 50% of time spent on the technology and programming that goes with it.

We envision the workshop to expose people to the machine learning workflow along with required technology topics to implement the workflow in a typical project. Most importantly, the students will implement these workflows so they walk away from the workshop with a fully built out project.

This issues serves as a place for us to propose and finalize what this new curriculum/agenda should look like.

eswan18 commented 3 years ago

Along with the schedule, something we should also decide is what data set we want to work with. We use the planes data for the first two classes, and it's already pretty clean, so I don't think that's a good candidate.

One site I discovered recently has a catalog of ML datasets, so it might be worth looking over: http://archive.ics.uci.edu/ml/datasets.php

bradleyboehmke commented 3 years ago

The UCI repo has been a classic for some time. A couple other considerations:

I was also thinking about trying to find a continuous realtime feed data source. The end-to-end project could be creating an program that runs daily and gets the new data daily, preprocesses it, checks for significant changes in data distribution, and then runs the model.

eswan18 commented 3 years ago

I guess my opinion really depends on how much you already have prepared around the Ames dataset. Will that make it significantly easier to make the sections on modeling? If so, for simplicity that seems like the best option – though it's really up to you, because I think the time savings would mainly be in preparing your sections.

If not Ames, the Complete Journey seems like a great option. I was totally unaware of it somehow, but we know the schema inside and out, and it's already clean – so we don't need to spend a lot of time on wrangling, which was already covered thoroughly in the first two classes.

While I like the realtime idea, I think it presents a lot of additional complexity. Students wouldn't get to see the "daily" nature of it much during the course itself, and running something daily on a laptop that could be off or sleeping is tricky (plus complexity around scheduling on different OSes). The cloud would be a good option for that but is an enormous can of worms.

In short: I think we should use either Ames or the Complete Journey; the former if it gives us a significant headstart on content prep, otherwise the latter, and I'd defer to your judgment on that.

bradleyboehmke commented 3 years ago

For the ames data Brandon and I used it extensively in our book and in our ML with R workshop. I have already translated some of this content over to using Python and scikit-learn but not all of it. But having this content lets us understand what we should expect when translating other parts of the content from R to Python.

With that being said, the complete journey content would be fun considering it is newer but I honestly don't know what to expect as far as predictive performance in models so it would take some playing with to make sure I have my arms around the data and how well models work with it. With that being said, it would be truly original content that could open the door for other opportunities including bringing this workshop in-house 🤷‍♂️.

eswan18 commented 3 years ago

I think ultimately it's your call – whichever we choose won't make a big difference in the technical aspects (Git, project structure, coding style). If you want to tackle the 84.51˚ data, I'm absolutely game, but if you would prefer to take advantage of the Ames material you already have then that makes perfect sense to me. You can think it over if you want and let me know.

bradleyboehmke commented 3 years ago

Proposed agenda & division of labor:

Ethan

Brad

@eswan18, I made a few slight revisions from your email

Agenda

eswan18 commented 3 years ago

I'm going to close this @bradleyboehmke – but if you think there's more to do yet, feel free to reopen.