askmarilyn: Can she (as the Judge) tell me how much I drank?

thorehusfeldt commented 5 years ago

The infrastructure with interactive problems is still a bit of a mystery to me. (It’s hard to simulate the user experience from the command line.)

I would love Marilyn to end the interaction with “You ended up with 659 beers. Well done.” (for AC) or “You ended up with only 502 beers. Too bad.” (for WA). This would be particularly useful for solvers who implement the wrong strategy. I know there is some way for Kattis to show hints, but I’m not sure how to write judge messages that end up in the right place.

simonlindholm commented 5 years ago

The way to do it is to read a directory name from argv[3], then create $dir/teammessage.txt with the output message (this is vaguely documented at https://www.problemarchive.org/wiki/index.php/Output_validator ). I think Kattis then shows you the message for the testcase you fail on, if you do, but not for earlier accepted testcases.

I agree that it's nice to tell the user whether their strategy is bad of if they are just outputting things in the wrong format.

thorehusfeldt commented 5 years ago

Turns out there is an author_feedback function in validate.h that talks to that directory for me. I’ve tried to cook something up in feat/marilynfeedback but it feels extremely fragile. I have no idea of telling is this actually works.

Well… I can of course run one-round interactions by hand (and that works and puts teammessage.txt in the right directory), but what is needed is to run all the input files against all the submissions and look at their feedback. Do I have to write that script myself? (I can, and would consider it a good exercise. But I’d prefer to learn problemtools.)

simonlindholm commented 5 years ago

I don't know a good way to test it; I would be inclined to take your commit as is and then test it on Kattis when the problem has been installed (which could happen far in advance of the contest if you email Kattis people and tell them that this requires early testing). If you feel the need for manual testing, then yeah, you probably do need to write that script yourself, or hack problemtools to leave temporary files around -- there's no built-in functionality for it AFAIK.

thorehusfeldt commented 5 years ago

I’ll write my own script. ’twill be good fun. Where do I put it without confusing problemtools?

simonlindholm commented 5 years ago

Basically anywhere that isn't input_format_validators/, output_validators/, graders/, submissions/<subdir>/ or data/<subdir>/. In particular it's fine just to put it at toplevel.

thorehusfeldt commented 5 years ago

OK, I’m going this route:

# verifyproblem askmarilyn -l info
Loading problem askmarilyn
[...]
Checking submissions
INFO : Check AC submission sl.cpp (C++)
[...]
INFO : Running on test case group data/secret
INFO : Test file result: AC [message: Congratulations! You got 653 drinks., CPU: 0.01s @ test case secret/1-0]
[...]

I find this extremely useful for development. Does it make sense to polish it and make it a pull request or am I wasting everybody’s time with something like that?

thorehusfeldt commented 5 years ago

It now summarises over test data and reports the last message for each submission, just like the rest of the verdicts. Seems to work great and makes the problem so much more accessible.

  AC submission sl.cpp (C++) OK: AC [message: Congratulations! You got 650 drinks., CPU: 0.10s @ test case secret/2-2]
   AC submission thore.py (Python 3) OK: AC [message: Congratulations! You got 649 drinks., CPU: 0.16s @ test case secret/3-1]
   Slowest AC runtime: 0.160, setting timelim to 1 secs, safety margin to 2 secs
   WA submission always_door_A.py (Python 3) OK: WA [message: 330 drinks in 1000 rounds. Too bad., test case: test case secret/1-0, CPU: 0.04s @ test case secret/1-0]
   WA submission break_protocol.py (Python 2 w/PyPy) OK: WA [message: Your guess must be a valid door name, such as A., test case: test case secret/1-0, CPU: 0.02s @ test case secret/1-0]
   WA submission first_a.py (Python 3) OK: WA [message: 502 drinks in 1000 rounds. Too bad., test case: test case secret/1-1, CPU: 0.14s @ test case secret/1-0]
   WA submission first_b.py (Python 3) OK: WA [message: 532 drinks in 1000 rounds. Too bad., test case: test case secret/1-2, CPU: 0.13s @ test case secret/1-2]
   WA submission first_c.py (Python 3) OK: WA [message: 506 drinks in 1000 rounds. Too bad., test case: test case secret/1-3, CPU: 0.14s @ test case secret/1-3]
   WA submission ignore_positive_hint.py (Python 3) OK: WA [message: 0 drinks in 1000 rounds. Too bad., test case: test case secret/1-0, CPU: 0.14s @ test case secret/1-0]
   WA submission plays-1001-rounds.py (Python 3) OK: WA [message: You won't stop talking!, test case: test case secret/1-0, CPU: 0.05s @ test case secret/1-0]
   WA submission plays-999-rounds.py (Python 3) OK: WA [message: You must begin round 1000 by guessing a door., test case: test case secret/1-0, CPU: 0.13s @ test case secret/1-0]
   WA submission plays-forever.py (Python 3) OK: WA [message: You won't stop talking!, test case: test case secret/1-0, CPU: 0.05s @ test case secret/1-0]
   WA submission random_door.py (Python 3) OK: WA [message: 342 drinks in 1000 rounds. Too bad., test case: test case secret/1-0, CPU: 0.12s @ test case secret/1-0]
   WA submission silent-death.py (Python 2 w/PyPy) OK: WA [message: You must begin round 1 by guessing a door., test case: test case secret/1-0, CPU: 0.02s @ test case secret/1-0]
   WA submission spam.py (Python 3) OK: WA [message: Your guess must be a valid door name, such as A., test case: test case secret/1-0, CPU: 0.01s @ test case secret/1-0]
   TLE submission forever-silent.py (Python 2 w/PyPy) OK: TLE [test case: test case secret/1-0, CPU: 4.00s @ test case secret/1-0]
askmarilyn tested: 0 errors, 0 warnings

thorehusfeldt commented 5 years ago

Hehe. The stubborn WA player ignore_positive_hint who sternly refuses to take positive hint from Marilyn does maximally bad against the super-friendly Marilyn who always shows him where the beer is. “No beers for you.”

simonlindholm commented 5 years ago

Seems reasonable to submit a problemtools PR for! Have it read judgemessage.txt as well though and prefer that if it exists; teammessage.txt is rarer and generally gives more vague errors.

thorehusfeldt commented 5 years ago

Solved by patching verifyproblem and updated feedback accordingly in 526f6ca629414be72e264fa567426aebdaae0fb5 . Patch submitted as PR at https://github.com/Kattis/problemtools/pull/140

thorehusfeldt / will-code-for-drinks-F2019

askmarilyn: Can she (as the Judge) tell me how much I drank? #15