Feedback and Unit Tests

t-lenz commented 9 years ago

Feedback: table columns shown on "submissions" page depend on feedback
Unit Tests:
- changed semantics
- implemented server-sided unit tests
- seperate unit tests from normal submissions
- special "details" page

fagu commented 9 years ago

I think we should display the time and memory used (not just whether they are in range) in the unit test tables.

fagu commented 9 years ago

When going to the list of submissions for a task in the admin interface and expand one of the status entries, I get the following error: [Cannot get score type - see logs]

t-lenz commented 9 years ago

Regarding time and memory: when only presenting the time and memory used you lose the information, whether you're really missing the limits or the outcome is ambiguous. Also in most cases these numbers are not what you're really interested in. So I would suggest to still use the symbols but add a tooltip to each of these cells presenting the actual numbers.

t-lenz commented 9 years ago

Also, some "visual error": when you expect a group to fail due to, say, "time" and all testcases are solved by the solution, all these cells are colored green (because it's perfectly okay for each individual case). Maybe we should set their background to a special color (e.g. yellow) or set the whole column too red iff the group fails.

t-lenz commented 9 years ago

This should fix most of the issues mentioned above. However, we still need to discuss about the info-parameter.

fagu commented 9 years ago

I can't get it to run anymore. When importing, it complains that TaskConfig doesn't have a property named "unique_name". I guess we should add that.

Edit: The other problem I described here was due to an old contest in my database.

fagu commented 9 years ago

Also, I get exceptions of the following form when test submissions are scored after public test cases but before private test cases have been evaluated:

2014/09/07 17:32:18 - ERROR [ScoringService,0] Unexpected error when scoring submission 53 on dataset 5.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cms-1.2.0pre-py2.7.egg/cms/service/ScoringService.py", line 109, in _scorer_loop
    self._score(submission_id, dataset_id)
  File "/usr/lib/python2.7/site-packages/cms-1.2.0pre-py2.7.egg/cms/service/ScoringService.py", line 171, in _score
    score_type.compute_score(submission_result, True)
  File "/usr/lib/python2.7/site-packages/cms-1.2.0pre-py2.7.egg/cms/grading/scoretypes/SubtaskGroup.py", line 89, in compute_score
    return self.unit_test_compute_score(submission_result, public)
  File "/usr/lib/python2.7/site-packages/cms-1.2.0pre-py2.7.egg/cms/grading/scoretypes/SubtaskGroup.py", line 370, in unit_test_compute_score
    evaluations[idx])
KeyError: u'0004'

The reason here is that 0004 is a private test case for which no evaluation exists, yet.

I don't think it makes sense to validate unit tests for public test cases with higher priority (whatever that would mean...), so we shouldn't do anything before all test cases have been evaluated.

By the way, what would you think about scoring test submissions both in the normal way (user_compute_score) and in the unit test way (unit_test_compute_score) and saving both to the database?

t-lenz commented 9 years ago

This should fix the issues you mentioned.

t-lenz commented 9 years ago

By the way, what would you think about scoring test submissions both in the normal way (user_compute_score) and in the unit test way (unit_test_compute_score) and saving both to the database?

Being honest, I don't see your point here. Can you think of an application?

Also this could get rather nasty to implement, since at the moment submission_id is a primary key for SubmissionResult, i.e. we can't have two results for the same submission in the database. Basically we could

make two submissions for each unit test (one with additional_info, one without) or
have an option on the "details" page for unit test to show how the submission would be judged if made by a normal user.

The first option would be rather slow, whereas the second would not save anything to the database.

fagu commented 9 years ago

Saving both to the database seems cleaner to me. For example, you compute public_score and score both in user_compute_score and in unit_test_compute_score, which is slightly redundant. I find it a bit worrying that public_score_details, score_details, ranking_details, max_scores have a different meaning depending on whether we look at a unit test or not, meaning that we have to make sure that other parts of the code (that we didn't write ourselves, and that might change in the future) don't rely on those values. I wanted to propose to just add a column unit_test_score_details to the SubmissionResult table (which already has score, public_score, score_details, public_score_details, ...). We would then call unit_test_computescore (in addition to (user)compute_score) from here: https://github.com/ioi-germany/cms/blob/feedback_and_tests/cms/service/ScoringService.py#L176 It shouldn't require much work to do this (and I could actually do it if you want). What do you think about this?

t-lenz commented 9 years ago

That sounds good to me. Go ahead!

fagu commented 9 years ago

I've created the unit_test_score_details field in the database. Should you have done any commits before pulling, you can instead git fetch and then git rebase so that we don't get those nasty merge commits.

fagu commented 9 years ago

And I'm still in favor of passing the submission_info parameter to compute_unit_test_score instead of the ScoreType constructor. Basically, the reason is again that the way we currently do this breaks somewhat deep assumptions on the behavior of ScoreType and thus parts of the existing code and potentially of the future code on the italian version. I would also prefer to put the "unit test" case of max_scores in a separate function (the max_scores function does not behave as its documentation string says).

t-lenz commented 9 years ago

I'm finally convinced. Unfortunately, I won't be able to work on the CMS at least the next three weaks.

t-lenz commented 9 years ago

This should finally fix this issue.

t-lenz commented 9 years ago

This should fix the issues mentioned above.

fagu commented 9 years ago

Would you mind adding the actual time and memory used to the command line tables as well (following the tooltips in the html tables)?

t-lenz commented 9 years ago

Regarding time and memory in cmsGerMake: honestly, I can't think of some good presentation for this: additionally showing them on a line by themselves could be quite opaque. Furthermore, you normally run cmsGerMake on your own machine, which at least renders the time somewhat useless.

fagu commented 9 years ago

But isn't just displaying ✓, ≈ or ✗ even more useless on your own machine, since it doesn't only suffer from the difference between the machines but also from an extreme rounding (to only 3 different values)? What would be that bad about simply displaying the numbers next to the symbols (maybe the symbols to the right of the numbers) and making the table a little wider? Not the most beautiful thing, but worth it in my opinion. By the way, could we arrange for fewer line breaks in the verdict columns? I find those a bit disturbing since they basically double the tables' heights.

ioi-germany / cms

Feedback and Unit Tests #2