slatex / sTeX

A semantic Extension of TeX/LaTeX
49 stars 9 forks source link

Answer Class Annotation Specification #398

Open lambdaTotoro opened 10 months ago

lambdaTotoro commented 10 months ago

During today's Systems meeting it became clear that we need to better pin down exactly how and what needs to be annotated in learning objects (especially quiz questions) for sensible learner model updates.

Here's what I think:

In the following, I'll only consider single / multiple choice questions, but with not too much extra work, this can be extended to clozes and numerical inputs. IMHO, an answer class annotation should contain the following information for quality LM updates:

Note that one "answer class" does not generally correspond to one of the quiz question choices. That might often be the case — especially for single choice questions — but we also want the possibility of multiple student inputs mapping to the same answer class (e.g. "both answers are wrong because of wrong order of operations") or even one input triggering multiple answer classes (e.g. "contains errors in arithmetic with fractions and confuses mean and median").

In my mind, this has been quite stable for a while but of course I am open to feedback of any kind. Also compare #391 where the implementation side is being discussed.

kohlhase commented 8 months ago

Here are my 2c on the problem.

I think this is (at least for the moment) over-engineered and too complex for the current use-case of simple mcqs and scqs. I agree that we (in the long run) will need the ability specify answer classes for any combination of choices in MCQs, but for the moment we do not have the experience or the examples to support this. For the moment, let's make due with the data we have in the problems.

What I currently see in the quizzes is the following data per problem:

So the following update strategy suggests itself:

That leaves us to define what a "boost" might be. Here is my initial idea:

The a boost of O = <c,d,w> with w\in [-1,1] updates <c,d,v> with the new value v'=((k-1)v+w)/k. With this formula we get that each successive boost has a smaller influence.

We can think of this as the default behavior that more specific answer patterns can overwrite if we specify them. At least this sis something I would start experimenting with: only if we have a default then we can decide if the hassle of specifying a different behavior is worth it.

lambdaTotoro commented 8 months ago

I hear you on the front of "this is over-engineered for SCQs and MCQs" and I'm happy to go with something simpler for now, but I do think that we eventually want something very close to that at the very latest for questions in guided tours.

The "correctness coefficient" you introduced I also think could be useful. However, I'm a bit apprehensive about logging a million and one "boosts" that we'll have to sift through (or count, at the very least) again and again for every future update. I have an alternative in mind, though. We already brainstormed about this once during a jour fixe: we could just introduce another dimension for a database entry, something like a "certainty" value ∈ [0,1] that would start very low and grow with the amount of information we have about a learner's knowledge. That could take the place of the k in your equations.

We said yesterday that we're happy to treat "prerequisites" as "dependencies of objectives" (at least for now), so we'd only ripple-update the objectives with, say, a hardcoded value multiplied with the CC (multiplied with some function on the certainty value). And from that we can only start doing fancier things later.

So, for any given question, my script (to be run by hand for now) would need:

Hundecode commented 8 months ago

lambdaTotoro

"I think we want to be able to specify answer classes that say "student understands concept 1 well enough but clearly doesn't understand concept 2" That is exactly what we NOT want. Answer classes are based on purely objectively observable criteria. Whether a learner understands something cannot be objectively observed and therefore may not be an answer class.

Jazzpirate commented 8 months ago

then let's call them something else. What was it? "Feedback class"? ;) The question is whether we want to be able to specify feedback classes that reflect probable causes of given answers. We could say "No, that's not observable" - fair, but then we can't update the user model accordingly. Or we do exactly that, in which case it's likely that we often/occasionally misdiagnose an answer and update the model wrongly.

Both options have advantages and disadvantages.

Hundecode commented 8 months ago

Here are a few more thoughts on my part:

How the learner model should be triggered, based on an answer class, has no place in the answer class itself in my opinion. Answer classes only give information about the state of an answer. How and when the learner model is updated should not be part of the answer class concept and should also be able to be influenced to a certain extent by the task creator / educator.

But it also does not belong in the task itself, in my opinion. Because you can use a certain task in different contexts. That's why I would possibly distinguish between task and learning object. The task is the problem itself. So an actual state, which must be transferred into a target state. The learning object, on the other hand, contains further information, e.g. which task and how the learner model is triggered, at which answer class. Of course, the latter step can also happen partially automated. e.g. for each occurring concept C there is a boost to <C, remember>, if the task was processed correctly.

I absolutely agree with Michael that the boosts need to be dependent on the learner model. The same task with the same answer can trigger different boosts for different learner models

kohlhase commented 8 months ago

The "correctness coefficient" you introduced I also think could be useful. However, I'm a bit apprehensive about logging a million and one "boosts" that we'll have to sift through (or count, at the very least) again and again for every future update.

I am apprehensive about this as well, but we somehow have to take the "interaction history" into account. I realized that for my formula, we only need the number k of previous updates. That may be a reasonable "sureness" coefficient.

Hundecode commented 8 months ago

The question is whether we want to be able to specify feedback classes that reflect probable causes of given answers. We could say "No, that's not observable" - fair, but then we can't update the user model accordingly. Or we do exactly that, in which case it's likely that we often/occasionally misdiagnose an answer and update the model wrongly.

The educator specifies what the objective is for each learning object. So e.g. <Array, Understand>.

We don't need specific answer classes to examine that. If the task has been solved to some extent, then it can be assumed that there is an understanding because the Objective has been met.

I think the dimension Remember and Apply can be easily handled. Because if an array occurs as a concept in the answer and the answer was correct, then it can be said with some kind of certainty that the learner applied this concept successfully.

Dimension "Remember" analogously, if the concept occurred in the task description or solution.

kohlhase commented 8 months ago

I think the dimension Remember and Apply can be easily handled.

But in many situations, apply is the most important one to be tested by problems.

Jazzpirate commented 8 months ago

The point of this issue is to solve the problem of updating the learner model based on answers given by users on problems. The terminology is not the problem. If you prefer, call them "feedback classes" instead. The distinction makes sense, don't get me wrong, but it's not relevant for the problem at hand.

If we want to distinguish between answer classes and feedback classes in the tex sources, we might have a more pressing problem at hand. If we reach the state where a problem is three lines of tex code, but all the biolerplate and annotations add 10x that because we have to specify both answer classes and feedback classes and connections between them and individual updates for every objexti/prerequisite x feedback-class, the effort of annotating all of that will explode in comparison to the utility. So there is also a point to be made in favor of being imprecise but short vs. being precise but elaborate.

The learning object, on the other hand, contains further information, e.g. which task and how the learner model is triggered, at which answer class.

For the purposes of the problem at hand, a learning object is a tex fragment. All information that needs to be provided for purposes of user model feedback currently has to go into the tex sources. We do not have the infrastructure or a model for having "multiple learning objects for one task", so pragmatically, they're equivalent. Again, though, we might want to keep in mind that in the long term, we may want to take "context" into account for user model updates. But for now, we don't have the prerequisites to do that, so let's not let that keep us from getting somewhere

Hundecode commented 8 months ago

But in many situations, apply is the most important one to be tested by problems.

If so, then these should be annotated as Objectives in the Learning Object.

Case 1: the answer is correct Then the learner model can be boosted

Case 2: the answer is not correct Then it depends on the answer classes. For example, was only one edge case forgotten? Then there should be a (smaller) boost.

Case 3: is the answer completely wrong or were certain Common Errors made, which are indications that someone is not applying a concept correctly? --> no boost or negative boost

etc.

kohlhase commented 8 months ago

yes, I agree with almost all you say. BUT, the topic here is S/MCQ and how to deal with them in quiz questions. There it is interesting to have a stop-gap solution that can be easily realized with the data/annotations we have currently.

Jazzpirate commented 8 months ago

But in many situations, apply is the most important one to be tested by problems.

I somewhat doubt that is the case, simply because the kinds of problems that really can be evaluated automatically to trigger automatic user model updates are likely not the kinds of problems that can test apply well :D

I would guess that we will be constrained to mcb/scb/fill-in-the-blank-problems for quite a while, and they're probably more on the remember/understand side of things

Hundecode commented 8 months ago

A good multiple/single choice task does not include random "wrong" answers in addition to the correct answers, but the wrong answers are didactically cleverly chosen to trigger certain Common Errors.

Therefore, each wrong answer (or even a combination of wrong ones, in MC) is an answer class.

That means the algorithm for MC/SC questions is very simple

If the answer(s) is completely correct --> Update the learner model based on the annotated objectives and information in the learner model (e.g. how many times this task, or a task with identical objectives and concepts has been done before)

If the answer was incorrect --> update the learner model based on the answer classes The information for this is annotated in the MC task. So which boost happens at which answer class

Jazzpirate commented 8 months ago

That means the algorithm for MC/SC questions is very simple

What you stated is not an algorithm, and it leaves exactly the questions open that we need to answer:

Update the learner model based on the annotated objectives and information in the learner model (e.g. how many times this task, or a task with identical objectives and concepts has been done before)

^ how exactly? The "e.g." is doing a lot of work here. How exactly is the update "based on" the objectives and the learner model?

The information for this is annotated in the MC task. So which boost happens at which answer class

And now I need a precise specification for how this information is annotated to such an extent that it can serve as an adequate, well-defined and complete input to the algorithm

Hundecode commented 8 months ago

Lets make an example:

Task: Given is the following equation. Indicate for which x \in \realNumbers the equation is correct. 2+3*4+3^0 = 2x

a) 10 b) 7 c) 7,5 d) 10,5

lets assume the objective is: <Apply, Something>

The answer classes are AC1 = {correct solution} AC2 = {multiplication/division before addition/substraction not considered} AC3 = {^0 is calculated as 0 and not as 1}

Therefore a) maps to {AC2, AC3} b) maps to {AC3} c) maps to {AC1} d) maps to {AC2}

If AC1 is the case we boost the learner model with <Apply, Something>

The other ACs trigger different things e.g. AC3 --> decrease <remember, power law>

Jazzpirate commented 8 months ago

If AC1 is the case we boost the learner model with <Apply, Something> The other ACs trigger different things e.g. AC3 --> decrease <remember, power law>

"different things", "boost with", "decrease" are undefined (and exactly the topic of this issue). The information that "e.g. AC3 --> decrease <remember, power law>" is not present in your answer classes, and needs an annotation schema (which is the topic of this issue).

Hundecode commented 8 months ago

During today's Systems meeting it became clear that we need to better pin down exactly how and what needs to be annotated in learning objects (especially quiz questions) for sensible learner model updates.

I focused on the "what" needs to be annotated

Let's use this concrete example as a working example. I will also think about the "how" until Monday

Jazzpirate commented 8 months ago

We still don't have the "what" - my assumption is that if we know what exactly we need in terms of annotations, we get the "how" quite naturally for free.

multiplication/division before addition/substraction not considered, ^0 is calculated as 0 and not as 1 <- your answer classes diagnose causes that are not observable, so if I understood you correctly, there are not answer classes? Can we guarantee that a student answering "10" did not just guess a nice number? Or did we decide that we want to diagnose in favor of better user model updates? (I'm fine with either, but I'm surprised that you seem to have changed your mind)

Also, note that if the "how" contains "we associate each possible answer with an answer class, and each answer class with precise user model updates for arbitrarily many prerequisites and objectives", then this can easily explode, in particular for multiple choice questions - which does not mean that it's not what we should do / allow for, but in that case we should also think about reasonable defaults or easily specifiable abbreviations, so that we don't demand way too much of authors

Hundecode commented 8 months ago

multiplication/division before addition/substraction not considered, ^0 is calculated as 0 and not as 1 <- your answer classes diagnose causes that are not observable, so if I understood you correctly, there are not answer classes? Can we guarantee that a student answering "10" did not just guess a nice number? Or did we decide that we want to diagnose in favor of better user model updates? (I'm fine with either, but I'm surprised that you seem to have changed your mind)

the fact that a term a+bc is evaluated to (a+b)c is observable. What is not observable is whether someone has not "understood" or "remember" this rule or not. That's why I don't like MC Questions, because there is no "way of calculation" and the answers are already given and the probability that someone just guesses is way higher. However, I agree with you that the combination of both cases is somewhat critical and in a way I assumed that both errors happened at the same time.

And yes I agree. We will never know if someone just guessed. Just as we can never tell if someone has understood a concept or simply found the solution on the Internet and copied it in. We can only have some evidence that it is the case.

Jazzpirate commented 8 months ago

the fact that a term a+bc is evaluated to (a+b)c is observable.

No it's not. What is observable is either "10", "7", "7,5" or "10,5", and only that.

We will never know if someone just guessed

Therefore, how a student arrived at their answer is not observable. Therefore, you are guessing a students mental state and dignaosing likely errors. Which, again, is fine, but that means that answer classes heuristically determine the likely causes of answers. If we agree on that, there's a chance we're updating the user model wrong, but there's more potential to update it more precisely as well.

Hundecode commented 8 months ago

the fact that a term a+bc is evaluated to (a+b)c is observable.

No it's not. What is observable is either "10", "7", "7,5" or "10,5", and only that.

We will never know if someone just guessed

Therefore, how a student arrived at their answer is not observable. Therefore, you are guessing a students mental state and dignaosing likely errors. Which, again, is fine, but that means that answer classes heuristically determine the likely causes of answers. If we agree on that, there's a chance we're updating the user model wrong, but there's more potential to update it more precisely as well.

Then it means that in MC-questions nothing is observable at all, except that a certain (combination) of answers was clicked. Not even that ... it could be that the learner slipped, fell on the keyboard and accidantly selected answer c)

Jazzpirate commented 8 months ago

Then it means that in MC-questions nothing is observable at all, except that a certain (combination) of answers was clicked.

Yes, exactly. An answer class is precisely a (combination of) choice(s), as you correctly pointed out earlier, we discussed often, and agreed on ~ half a year ago.

it could be that the learner slipped, fell on the keyboard and accidantly selected answer c)

In which case they still answered c), so that doesn't matter :)

lambdaTotoro commented 8 months ago

Hundecode: That is exactly what we NOT want. Answer classes are based on purely objectively observable criteria. Whether a learner understands something cannot be objectively observed and therefore may not be an answer class.

We will not get around guessing at some mental states of learners. If we rule that out, the only update we could possibly do are increasing the values for the objectives and prerequisites (because those are the only things independent of exactly what went wrong or how a question was solved).
But I think this would be locking us out of the big advantage of this classification: the educators can tell (with a sufficient™ degree of accuracy) what exactly was going on in the heads of students from the answers they gave and, by annotating this, can make this knowledge available to ALeA.

Forgive me for going out of the MC/SC paradigm for a second, but if the task is "Write a python function isEven() that takes one positive integer and returns one boolean indicating whether or not the given integer is divisible by 2 without remainder.", then one correct solution is the following:

def isEven(n):
    return (n ^ 1) == (n + 1)

But another correct solution is this one:

def isEven(n):
    return True if n == 0 else isOdd(n-1)

def isOdd(n):
    return False if n == 0 else isEven(n-1)

If we now restrict ourselves to only updating on what is directly observable (i.e. is the solution correct or not), then we can only update the goals/prerequisites of understanding parity or whatever. But I think we definitely also want to be able to annotate that the student with the first answer knows about bitwise operators and the student with the second answer knows about mutual recursion. Leaving that out would hurt but the quality of the learner model and the effectiveness of the system, imho.

That's why I think you should be able to specify per class what should be updated and by how much (on whatever scale we pick). Something like the tuples < concept × cognitive dimension × direction × severity? > I mentioned above.

Hundecode commented 8 months ago

We will not get around guessing at some mental states of learners.

then we are doing something fundamentally wrong. Even a good teacher does not guess mental states. He cleverly sets follow-up tasks or has information about the "learner model". From this information he gets evidence what was the cause of the error.

If we rule that out, the only update we could possibly do are increasing the values for the objectives and prerequisites (because those are the only things independent of exactly what went wrong or how a question was solved).

No, we can do much more! But we do not guess mental states or let the educator annotate them. Instead, we use information from the Learner Model in combination with the answer classes to update the Learner Model in a meaningful way. This is exactly what an educator does.

the educators can tell (with a sufficient™ degree of accuracy) what exactly was going on in the heads of students from the answers they gave and, by annotating this, can make this knowledge available to ALeA.

No, they can only do that if they have information about the learner, i.e. a learner model. If they just see an answer from an unfamiliar learner, then they can't infer any information about whether it's sloppy work or a misconception or whatever.

Hundecode commented 8 months ago

But I think we definitely also want to be able to annotate that the student with the first answer knows about bitwise operators and the student with the second answer knows about mutual recursion. Leaving that out would hurt but the quality of the learner model and the effectiveness of the system, imho.

Exactly such a similar example is in the Answer-Class paper. Because that is exactly what we want to do. We want to distinguish not only right/wrong but also, in the case of correct answers, which concepts were used and update the user model. Thats why answer classes do not only adress "wrong" answer. There is also an answer class "the answer uses recursion" etc.

Jazzpirate commented 8 months ago

@Hundecode Then we need a theory of how answer classes AND a learner model together entail a specific learner model update, and an algorithm that performs that update and a way to specify the required information (which now consists of objectives+preconditions+answer classes+learner models) in the form of annotations that are feasible for an author of a quiz to actually write down. And even then, a learner model will never give you a reliable way to estimate a learner's mental state, only a less inaccurate one.

If we can develop such a schema that is realistically usable in practice, I certainly won't stand in the way. In the absence of that, and in the short term, we need to focus on what we can actually do ;) And one thing we can do is 1. use answer classes directly and only poorly updatemodels based on observable evidence alone, or 2. use "feedback classes" (or whatever we called them) interchangeably with answer classes and heuristically guess mental states to provide better user model updates, at the cost of updating too harshly or entirely wrongly occasionally. (Actually, we can probably never exlude the latter, but well)

Jazzpirate commented 8 months ago

I should also emphasize again that there's a difference to be made between:

  1. What we can do in the long run, once we have hashed out a good theory how to do things, and have the data and experience to do something fancy, and
  2. what we can do right now with the quiz questions wo do have that are being actively done every tuesday

We can easily do something simple for now and move on to something more fancy later, when we know what works or doesn't work and how to do it better

lambdaTotoro commented 8 months ago

Hundecode: Even a good teacher does not guess mental states. He cleverly sets follow-up tasks or has information about the "learner model". From this information he gets evidence what was the cause of the error.

Of course they do. Interpreting evidence present in answers given by students is "guessing" mental states (since you cannot observe them, you have to extrapolate), previous learner model state or no previous learner model state.

Hundecode: Instead, we use information from the Learner Model in combination with the answer classes to update the Learner Model in a meaningful way.

Okay, but what exact information do we use and how exactly do we use it? At the end of this, there needs to be a concrete algorithm and that can't rely on words like "meaningful", only on "this parameter we decided on", "that parameter we pulled from the learner model" or "those parameters the author annotated".