Provide support for repeated TLE executions

pc2ccs / pc2v9

Version 9 of the PC^2 Programming Contest Control System

Eclipse Public License 2.0

46 stars 23 forks source link

Feature Description: With the advent of increasing use of PC2 in a "virtual contest" environment (that is, where PC2 modules run "in the cloud" instead of on hardware provided "on-site"), there is an increased potential for inaccurate timing during execution of submissions. For example, two consecutive submissions may end up executing on AutoJudges which run on "virtual machines" with unknown different characteristics -- one of which runs faster than the other due to cloud operation overhead, for example.

This issue is not likely to be huge for most submissions -- but the one place it could be significant is with submissions that fail due to "Time Limit Exceeded" (a program might have failed on one VM whereas it would have succeeded on another VM or even on the same VM if run at a different point in time when the cloud resources weren't as loaded). In particular, the thing which "kills" a submission execution when it hits the "problem time limit" is an instance of PC2 class ExecuteTimer; there is potentially a lot of variability both in the JVM overhead for invoking the actionPerformed() method of the ExecuteTimer and for the execution of the code within the actionPerformed() method.

One solution to this problem could be to provide support for multiple executions of submissions which fail due to "TLE". For example, if a submission receives a TLE, it could be executed (say) five more times, and the actual elapsed time for each of the executions could be compared. If the variance between these times is small (say, within 10%?) then it's likely that all the executions received "equally fair" amounts of time, and therefore the submission really did hit a "time limit exceeded". If on the other hand there is wide variance in the actual run times, it's likely that some of the executions suffered more from system overhead than did others -- meaning the submission might have been able to execute within the time limit on a less-loaded machine.

Do you have any specific suggestions for how your feature would be **implemented in PC^2?**

Provide Admin configuration items "TLE Tolerance" (a percentage), and "TLE Re-execution Count" (the number of time to repeat execution of submissions which receive a TLE.
In the Execute module, if a submission receives a TLE, then if the "TLE Re-execution Count" is greater than zero then (1) re-execute the submission "Count" times; (2) if the variance between executions is within the "Tolerance", assign "TLE" as the judgment.

It remains an item for discussion what to do if the variance is NOT within the Tolerance.

An approach similar to what's described in this issue was apparently tried at this year's "IOI (International Olympiad in Informatics)" contest. Here's a rough description of what they did:

First, they set an execution time limit for each test case ("run") that was 150% of the problem-specified "Time Limit" (so, if the problem TL was, say, 10 seconds, they allowed each run to execute for 15 seconds). The value "150%" was chosen empirically, based on some testing they did; I was told they would have been comfortable if it had been as low as 130% but they felt that any lower would potentially miss boundary cases. I didn't inquire further as to exactly what the testing was which they did to come up with the 130% and 150% values -- but clearly this is a "tunable parameter".
If a run continued executing for the entire execution time (that is, was still running when 150% of the problem-specified Time Limit was reached), they declared that it was extremely unlikely that platform variances were what caused that, and they marked the run "TLE".
For runs which exceeded the problem-specified Time Limit but finished within 150% of the Time Limit, they declared that it was "possible" that the run might have been able to complete within the problem-specified Time Limit (that is, at or under 100%) but that platform variances (such as other programs getting swapped in/out, cache variances, etc.) might have stopped it from completing in time. In this case what they did is re-execute the run a specified number of times (they used 5, but again that's a tunable parameter) on the same machine. If any one of the (five) additional runs completed within the problem-specified Time Limit (that is, within 100%), they accepted the run as having completed within the time limit. (Of course, they then forwarded it for other judging criteria, such as Wrong Answer.)

So basically their strategy was "if it got a TLE, but it did stop and did so within 150% of the Time Limit, give it several more opportunities to make it under the time limit". While this doesn't guarantee the elimination of platform variances as a factor, if you run it enough times it's highly likely to do so. (The question of course is, how many times is "enough times"? They felt that 5 times with 150% margin was sufficient...)

pc2ccs / pc2v9

Provide support for repeated TLE executions #128