pointOfive / STA130_F23

Python/JupyterHub implementation of this UofT classic
10 stars 14 forks source link

Homework and Tutorial 8 #19

Closed mistryrohan closed 10 months ago

mistryrohan commented 1 year ago

Adding in the old (unfinished) version of homework 8 for the sake of opening a new PR. Will be adding all the files when updates are made to them.

mistryrohan commented 12 months ago

Some concerns/comments:

  1. Tutorial 8 will need a tutorial assignment question since it was previously the presentation on ethics but that has been moved to previous weeks (am happy to make a draft if you'd like it seems pretty fun).
  2. Tutorial 8 just needs more content overall, a lot of it is covered in the homework and there isn't much to adapt from previous years' material
  3. Good thing for now is that we can just focus on editing the homework and finishing that so we know what new things to add to the tutorial instead
  4. I think the student version should be pushed after the tester is perfect so its just 1 change of removing the solutions and testing them on Markus (more efficient than last time)
  5. I'm not a fan of the dance question in the old pset, BUT, I do like the data (since I contributed to it in that year lol) and the model making questions so I can turn that into what was done in the videos where predictions (y_hat) was plotted against actual y_values. It makes a good transition into the R^2 tutorial discussion (and RMSE discussion too perhaps).
  6. Noticed that in your version of the lecture, there was more images and things related to RMSE. I think that would be perfect for tutorial, but I need your opinion on that first
  7. Would you be open to adding a very small intro to splitting data into train-test sets in my tutorial as then we can use the same dance data and relate it to the homework. It becomes a thing where "you used plots in the homework but here is a much better way we will be learning about soon".
mistryrohan commented 12 months ago

Added STA130_F21_songrecommendations.csv to the data folder.

pointOfive commented 12 months ago

Lots of detailed comments from me orienting myself to things here and responding to your comments, as usual; but, just work through them one by one as you've done (well) in the past :)

  1. Homework/STA130_HW_7_tester.ipynb file is updated with the message "31,459 additions, 369 deletions not shown because the diff is too large. Please use a local Git client to view these changes." Before I follow those instructions, an you, roughly speaking, describe your intention with the updates you've made to the last tester file? [You might just catch me up on the process we were doing if I've lost track of that -- I wasn't necessarily expecting any additional changes on the old tester file, but perhaps I forgot and should have been?]
  2. I'd like to see your draft/proposal for a Tutorial 8 Assignment. I'd expect I'd like your suggestion here.
    • I will soon see how the homework looks; but, generally, I'd like a tutorial "lesson" to perhaps start with R^2=cor(y,y-hat) as you've suggested, then move on to RMSE (as my lectures have done) and adopt the code that got "skipped" in the week 7 tutorial that visually shows RMSE for simple linear regression and/but then extend it to multiple linear regression, and then move into the notions of overfitting and training/test analysis where we can show that R^2 can be made to be 1 (just as RMSE can be made to be 0) in a training data set, but that it is the test data set where we can observe generalization
    • As mentioned, Cole's week 9 (and 10) are less concerned with questions of overfitting and generalization; so, this is content that I want us to pursue in week 8 to the best of our ability given the time/space limitations we'll have there.
    • You might consider reviewing my other comments and our other exchanges again as well to make sure that my intentions above are lining up consistently with our progress and progression of development so far
    • I am going to share the course project with you (soon)... it essentially amounts to model building variable selection exploration with a specific orientation towards examining and interpreting specific interactions... honestly, it might be worth trying to introduce everything above with the course project data exactly...
  3. Do you have any draft of tutorial 8 at this point yet? It's not in the commits; but, no worries, as I think my comments above suggest that having a clean slate for tutorial 8 is perfect: we have plenty of material that we can put into the tutorial is far as I'm concerned.
  4. Agreed -- let's see what the hw finalizes as; and, then, what the tutorial should be will likely become pretty clear
  5. Agreed -- good plan
  6. I like your orientation and idea here. I've tended to really like the proposals that you all have made. What happens when I get content ideas from everyone is, what I think is, this is good -- needs to be flushed out and content thickened so the reading is comprehensive and standalone and/or the tutorial is fully scripted and content full so the TAs don't need to improvise of come up with anything themselves; although, I'm happy for the TAs to "decide" how to use the tutorial material themselves... i.e., what they emphasize and dwell on and what they say "this is here, have a look closer if you want".
  7. Agreed -- yes, exactly, that's always been my feeling about RMSE since I decided there wasn't room for what got made for Tutorial 7. That was because I decided starting with correlation and spending time on that in tutorial was important; but, it then meant that there wasn't time to go into the RMSE in week 7 tutorial. Especially since I decided I wanted to focus on interpretation material (and the indicator variables) and such
  8. Yes, it has also become my plan/expectation that we need to motivate and introduce train/test in week 8 tutorial, because week 9 material seems to mostly just assume it and doesn't dwell tremendously on motivating it. This was some of my discussion in our previous exchanges (probably both in previous PR conversations, as well as perhaps slack DMs).
    • I do not expect to introduce train/test in lecture, as I'll more be motivating model building, interactions, etc.
    • It's good we've already introduced the idea of interactions in the week HW 7, and indicators a little bit there as well; and, indicators a little bit more extensively in the week 7 tutorial; so, I should be able to lecture on these things comfortably; and, it's just a "this is how you code this up". I likely will as well discuss R^2 and make some related comments about generalizability; but,
    • I do think tutorial is the right place to go a little deeper on this; so,
    • we are indeed looking to do a lot in tutorial 8... [leave model building/interpretability stuff up to me in lecture, and] focus on overfitting/generalization ideas for the tutorial
    • In some sense it feels like "shouldn't this belong in decision trees / classification / (machine learning)?" But, actually, honestly, I think it's more of a "deep" / "understanding" type of concept that is relevant for multiple linear regression; and, so, tutorial indeed is the location where I'd like this kind of content presentation to go
    • Cole's week 9 tutorial starts to dive into the ethics of FP/FN decision making; and, the week 9 homework goes into other things, such as "feature importances", which, I think is why the emphasis on train/test generalization hasn't really been the main point of week 9 materials, and/why/but has ended up feeling like it should belong in week 8.
    • Also, it's okay/probably best if train/test generalization is not a part of the week 8 homework, actually... I think using the homework to have students practice just doing multiple linear regression, and model building, and maybe interpreting predictions is best... so hopefully that's what I'll next be seeing as next take a first look at the proposed homework!
pointOfive commented 12 months ago

One more comment to add to those above: model building with p-values is the usual statistical approach; so, this needs to be presented/contrasted with the RMSE train/test idea as well.

pointOfive commented 12 months ago

hw8_tester

I will move over to work on Matthew's PR and the course project so that I can share with you what that's going to be. That I think will inform your thinking about if we can include and use some of that in the homework and/or tutorial; and, anyway, how we could orient our materials here to help the students prepare and be ready for what's asked of them for the final project.

I do quite like how this homework is shaping up... I imagine you'll next move into interactions...

mistryrohan commented 11 months ago

Thought it would be a good mental exercise to get a draft of the tutorial assignment question before going to bed. Here it is (also am still working on the tutorial slides):

As a first-year student exploring the vast amounts of opportunities university has to offer, you decide to join the basketball team (a friendly reminder to get involved in extracurriculars and events!). The coaches get to know you more and find out that you are studying statistics. Since the team is currently training for a provincial competition, the coaches have been collecting significant amounts of data and want to analyze the key factors influencing the team's performance. The coaches have a breadth of numerical data on shots, rebounds, assists, player experience, and player sleep. Also, they have categorical data on pre-game routines, off-court practice, health history, and player nutrition. They believe the more complicated model will allow them to fix all sorts of small issues in their team to help them perform at their best.

You explain how you have learned about multiple-linear regression and techniques on creating a reliable model. The coaches only know a little about simple-linear regression and are interested in learning your process in creating and selecting an appropriate model. Your task is to show an overview of this process, including the practical implications of your potential findings and what the coaches can do to support their players. You should write down some hypothetical equations, explain any transformations needed in the data, and the differences between simple and multiple-linear regression. Do not be afraid to use technical statistical terms, but be sure to explain their meaning in simple and understandable ways that would help non-statistical audience made sense of what you're taking about.

pointOfive commented 11 months ago

I'm pausing here to comment that this whole sequence is outstanding. This is exactly the way I want these homework assignments to go... this really helps guide the students through the use and concepts of things here... just really fantastic

pointOfive commented 11 months ago

continuing

I'm liking where this all seems to be going, but/and, I have a couple comments of what I'm hoping/expecting to see:

Do we/Can we add some model assumption checks?

Continuing...

pointOfive commented 11 months ago

Can we add a small little segment that discusses that an observation is a row which can obviously be multivariate and have many measurements?

... this is something that could be introduced straightaway with data frames (but I don't think I thought to do this); but, I don't think it's necessarily that relevant at that point in time; whereas, it becomes relevant in linear regression; and/but, I think it's okay if we wait until multiple linear regression as opposed to simple linear regression to introduce this idea...