pedal-edu / pedal

A collection of tools to analyze student's Python source code
https://pedal-edu.github.io/pedal/
MIT License
29 stars 7 forks source link

Reconsider gently/explain's score mechanism #122

Open acbart opened 7 months ago

acbart commented 7 months ago

If you attach a score to an explain or gently, the negative valence means that the score will be added if the feedback is NOT triggered. However, typically these two feedback functions are only generated when they are meant to be triggered. In theory, folks should be using their activate field to make sure that the data is attached regardless. But I think we need to think carefully about whether all of that is too confusing to users. Also possible is that we just need to make sure that people understand when to use compliment and give_partial.

lukesg08 commented 7 months ago

I think the intuitive assumption is if feedback with negative valence is triggered, they would subtract their score (or add their negative score) when triggered since negative valence means something is incorrect. Quote from the documentation (emphasis mine)

The valence of this feedback. Positive feedback is meant to be encouraging, while negative feedback is meant to be discouraging. Neutral feedback is meant to be informative. This has serious impact on how the feedback’s score is interpreted. Negative valence will have the score subtracted from the total, while positive valence will have the score added to the total.

The activate field seems like a weird extra step. We might also need to elaborate in the documentation how this field SHOULD be used (also, it's not clear to me either how it should be used).

acbart commented 7 months ago

I'll admit the activate mechanism does feel a little contrived and is unintuitive. The concept behind it is that you should always create a feedback object for a given check, regardless of whether or not it's supposed to be activated. This flips the conventional logic away from imperative if/elif/else chains for instructor control logic, and towards more declarative assertions. From the research side, we'll have clearer records of what feedback was checked, and not just what was activated (which I think is a more appropriate data model).

if student_code_has_error():
    gently("Feedback message")
# should become
gently("Feedback message", activate=student_code_has_error())

It's a different mental model than most of what we've done, and I think folks will struggle with it. But I think that it's something the docs should do better at explaining (and I should generate some clear examples where it ends up paying off). It's a hypothetical best practice, at this point, not a definite one...

But that aside, I agree about your overall point, the more I think and the more I look at the code. I believe I flipped the intention halfway through designing this particular subsystem (partially described in #54). This has led to an inconsistency, and the logic in the code is contrary to that documentation you've highlighted. Currently, if you try to attach a score to a gently/explain, and they are activated, then the points are not applied (because negative feedback objects only apply scores when they are NOT triggered - suggesting that the mistake did not occur). I don't think I've noticed because most of the time, I'm either checking correctness, or explicitly giving partial credit (rather than subtracting points).

Obviously, we need to make the documentation align with what the system does. But what should the system do? Here's some different cases, to help think through what folks expect.

# Negative feedback with a negative score
gently("", score="-10%")
# Negative feedback with a positive score
gently("", score="+10%")
# Positive feedback with a negative score
give_partial("-10%")
# Positive feedback with a positive score
give_partial("+10%")

At times like these, I wish I could pull a Stefik and put these in front of a bunch of feedback authors, to ask them what they think the outcomes of these different situations would be. Or even better, get some actual cases of what they are trying to write. I'm familiar with the model of "positive feedbacks give points" and "incorrect feedbacks subtract points". Are there other models? What do people expect when I write something like ensure_ast("For", score="+10%") or prevent_ast("While", score="-10%")? My mind goes in circles when I think about them.

I'd be interested in your thoughts on what you (and others) expect this feature to work. I suspect that I should just take whatever intuition tells everyone else and make that what the system does.

acbart commented 7 months ago

... Would having penalize and reward parameters instead of a score parameter help this problem at all?

lukesg08 commented 7 months ago

Maybe, but if we think from a traditional rubric, our point oriented system resembles either analytic rubrics or single-point/check list rubrics. If you break down an analytic rubric, you can arrive at a potentially long single-point rubric. So taking the discussion from there, we basically ask a yes or no, do they meet a particular criteria? From that perspective, it's actually about just meeting a specific criteria. How much weight we put on a criteria is what determines how many "points" such an item is worth. Whether we add or subtract points is simply a matter of preference and impression, at the end of the day, we are just checking the checkbox, and associating a point value for that checkbox. Value is used as a potential parameter (in the compliment documentation https://pedal-edu.github.io/pedal/teachers/tools/core.html), which kind of alludes to that idea, potentially thinking of this as feedback simply has a valence, and triggering the feedback appropriately modifies the overall score based on the valence. We might also think of valence being a global setting to ensure consistency...but I think that just brings about additional complications.

acbart commented 7 months ago

Debugging logic with the unit_test function today, I broke some tests with the new scoring code I had been trying out. I was adding points for passed test cases, and subtracting points for failed test cases. So passing 3/5 tests meant you got a 20% instead of a 60%, because it was "60%-40%". I'm not really that awake right now to try and figure this out, but I understand there are situations where we want to add points and situations where we want to remove points. But we probably shouldn't make the same feedback objects do both at the same time. This is very much the same problem that this article tries to explain, I think: https://community.canvaslms.com/t5/Higher-Ed-Canvas-Users/Understanding-Multiple-Answers-Questions/ba-p/263903

lukesg08 commented 7 months ago

I thought that was the point of the valence flag? Negative valence means that activating this feedback is bad, and thus points should be deducted, and positive valence meaning this feedback is good and points should be added. I think unit tests would probably fall under negative valence? If they "fail", they activate their feedback, which would deduct points since unit tests failing means that their should be corrective feedback. As far as combining negative and positive feedback to contributing to the score, that might be a little more wonky. My gut instinct is that all the negative valence feedback would deduct from a sum total, and all positive valence feedback would add to a sum total. E.g. if 60% of points were unit tests, and 40% of points were compliments, students would actually start off with "60%" and points would be deducted for every failing unit test that has a point value. And then with compliments, for each compliment added that has a point value, it would add up together.