tmetzl commented 3 years ago

I wanted to discuss how autograding works right now and how we could improve it. I am willing to help implement this. Please let me know what you think about it and if you have similar issues.

Partial Credit

Current behavior: Right now partial credit works by scanning the output of the test cell for an execute result that can be cast to a floating point number (partial credit).

Problem: Both students and graders are not only interested in how many points are assigned, but also why a test case failed. For example if we want to simply test the value of a variable we might have the outcomes Variable not defined, Relative error is x%, Wrong type of variable, got type X, expected type Y, etc.

Right now we can either print these errors or return partial credit.

Solution approach: Instead of checking for an execute result and casting it to float, we could use delimiters in the output which tell the autograder where in the output the partial credit can be found. For example:

Test Cell Output:

-----------------------------------------------
Test for variable x failed!
Expected type is Number, got type String

Test for variable y passed!
-----------------------------------------------
### BEGIN GRADE
3.0
### END GRADE

This way we could both give a reason for the grade and the grade itself.

Automatic Comments

Current Behavior: The only automatic comment that nbgrader gives is No response if a student did not answer the question at all.

Problem: Once you give an assignment multiple times you will identify common mistakes students make. Some problems we identified in our math class at our university are:

Student writes a print statement in a function instead of a return statement
Student uses wrong number format (in Germany the decimal point is actually a comma, Python expects a point)

We could give those comments as the output of the test cell but students are often confused by the outputs of test cells. It would be nice to automatically add these comments to the solution cell. The biggest problem here for me is that there exists no connection between a test cell and the corresponding solution cell in the database. Thus there is no standard way of accessing the solution cell that is being tested by a certain test cell.

Solution Approach: Have a reference of the solution cell a certain test cell is for. Use delimiters in the output to identify where the comments are. For example:

Test Cell Output:

-----------------------------------------------
Test for variable x failed!
Expected type is Number, got Tuple

Test for variable y passed!
-----------------------------------------------
### BEGIN COMMENTS
- Use a decimal point for floating point numbers
- Your function does not return a value but prints it
### END COMMENTS
### BEGIN GRADE
3.0
### END GRADE

Autograding of Markdown Cells

Current Behavior: No autograding of markdown cells is supported.

Problem: You can not set a markdown cell to autograded and in the test cell you can not get a reference to the solution cell.

Current hacky solution: Create a markdown cell with 0 points and grade_id answer1. Create a test cell with the number of points you want to give with the following content:

import nbformat
solution_cell_grade_id = 'answer1'
# thisnotebook.py is the name of the current assignment notebook
nb = nbformat.read('thisnotebook.py', as_version=4)
source = None
for cell in nb.cells:
    if 'nbgrader' in cell.metadata and cell.metadata.nbgrader.grade_id == solution_cell_grade_id:
        source = cell.source
        break

# The content of the solution cell is now in the variable source 
# and can be used as the input to NLP based grading functions

Solution Approach: Maybe have a reference to the solution cell in the database, automatically fill some variable with the content of that cell in case it is a markdown cell during executing the notebook.

Changing Autograder Tests after Grading

Current Behavior: When grading submission you might grade some of them and then realize your test cases could have been better. Now you can't regenerate the release version of the notebook and update the tests if there are already grades for that assignment in the database. You can now either delete all grades you had or fiddle with the database itself to update the tests.

Problem: You might have already graded some manually graded questions and don't want to lose the grades for those.

Solution Approach: Have a way to update a single test cell and rerun autograding just for that test cell, ignoring the grades of all other cells in the database.

myedibleenso commented 3 years ago

@tmetzl , these seem like very useful directions! I know you opened this issue awhile back, but I'd like to better understand your proposal.

-----------------------------------------------
Test for variable x failed!
Expected type is Number, got Tuple

Test for variable y passed!
-----------------------------------------------

Does this mean that either "Test for variable x failed!" or "Expected type is Number, got Tuple" errors have to be seen together with a successful "Test for variable y passed!" in order to earn 3 points partial credit?

Are you proposing being able to define multiple possible partial assignments? Continuing with that idea, if two or more possible assignments were to match, would the first match in the order they're listed take precedence?

tmetzl commented 3 years ago

@myedibleenso, I hope I understand your question correctly. I am not sure what you mean by Are you proposing being able to define multiple possible partial assignments?

So it might be easiest to take an example from an actual assignment I have given.

Say we have a question about a chi2 test from a statistics class (worth 10 points), where students are expected to calculate a chi2 value and a p-value for some given data. Then the test does the following:

1a) Check if chi2 value is defined 1b) Check if chi2 is of type Number 1c) Check if chi2 value is correct 2a) Check if p-value is defined 2b) Check if p-value is of type Number 2c) Check if p-value is correct

Now I give 5 points for a correct chi2 value and 5 points for a correct p-value. It does not matter if any test (1 or 2) fails, the other test(s) will always be executed. In order for the chi2 test to pass all tests 1a)-1c) have to be successful.

Of course this could be extended by weighing tests differently (say 7 points for chi2 and 3 points for p-value) or having nested tests with partial credit where a student could also score 2/4 points for the p-value if it is only partially correct.

jupyter / nbgrader

[Discussion] Autograding #1399

Partial Credit

Automatic Comments

Autograding of Markdown Cells

Changing Autograder Tests after Grading