Shadow loses submissions due to SocketTimeoutException

clevengr commented 3 years ago

Describe the issue:
The PC2 Shadow occasionally loses submissions which come from the remote CCS, which in turn sometimes causes failures in matching the scoreboards. (The reason it only sometimes causes scoreboard failures is that a lost submission for a problem might not be a submission that affects scoring.)

It appears that the loss of submissions may be related to the shadow throwing an exception when attempting to pull the FILES associated with the submission from the remote CCS. Specifically, I ran a shadow against a rerun of the NAC21 contest and, after detailed analysis, determined that the shadow received a notice from Kattis that there existed a submission with id 7519100, but no such submission was entered into PC2. The relevant log messages are given below; the complete log is attached below.

(Note: submission 7519100 is just one example; there were also other submissions lost in my test.)

...
210819 125843.604|INFO|RemoteEventFeedMonitorThread|run|Got event string: {"type": "submissions", "id": "3d6956ed-8356-40e1-876d-1c0a0daaf43a", "op": "create", "data": {"id": "7519100", "time": "2021-08-14T15:50:45.454+00", "team_id": "336519", "problem_id": "nac21.thepaladin", "language_id": "cpp", "contest_time": "0:50:45.454", "files": [{"href": "/clics-api/contests/nac21/submissions/7519100/files", "mime": "application/zip"}]}}

210819 125843.604|INFO|RemoteEventFeedMonitorThread|run|Found event: submissions:{"type": "submissions", "id": "3d6956ed-8356-40e1-876d-1c0a0daaf43a", "op": "create", "data": {"id": "7519100", "time": "2021-08-14T15:50:45.454+00", "team_id": "336519", "problem_id": "nac21.thepaladin", "language_id": "cpp", "contest_time": "0:50:45.454", "files": [{"href": "/clics-api/contests/nac21/submissions/7519100/files", "mime": "application/zip"}]}}

210819 125843.604|INFO|RemoteEventFeedMonitorThread|run|Processing submission event: {"type": "submissions", "id": "3d6956ed-8356-40e1-876d-1c0a0daaf43a", "op": "create", "data": {"id": "7519100", "time": "2021-08-14T15:50:45.454+00", "team_id": "336519", "problem_id": "nac21.thepaladin", "language_id": "cpp", "contest_time": "0:50:45.454", "files": [{"href": "/clics-api/contests/nac21/submissions/7519100/files", "mime": "application/zip"}]}}

210819 125843.604|INFO|RemoteEventFeedMonitorThread|run|Found run 7519100 from team 336519: event= {"type": "submissions", "id": "3d6956ed-8356-40e1-876d-1c0a0daaf43a", "op": "create", "data": {"id": "7519100", "time": "2021-08-14T15:50:45.454+00", "team_id": "336519", "problem_id": "nac21.thepaladin", "language_id": "cpp", "contest_time": "0:50:45.454", "files": [{"href": "/clics-api/contests/nac21/submissions/7519100/files", "mime": "application/zip"}]}}

210819 125843.604|INFO|RemoteEventFeedMonitorThread|run|Fetching files from remote system using id 7519100
...
210819 125853.620|WARNING|RemoteEventFeedMonitorThread|run|Exception parsing event: java.lang.RuntimeException: java.net.SocketTimeoutException: connect timed out: event = {"type": "submissions", "id": "3d6956ed-8356-40e1-876d-1c0a0daaf43a", "op": "create", "data": {"id": "7519100", "time": "2021-08-14T15:50:45.454+00", "team_id": "336519", "problem_id": "nac21.thepaladin", "language_id": "cpp", "contest_time": "0:50:45.454", "files": [{"href": "/clics-api/contests/nac21/submissions/7519100/files", "mime": "application/zip"}]}}
...
210819 125853.764|INFO|RemoteEventFeedMonitorThread|run|Found event: judgements:{"type": "judgements", "id": "07da0c4e-cc30-4f13-9393-f70263007b19", "op": "update", "data": {"id": "9896172", "end_time": "2021-08-14T15:50:46.917+00", "start_time": "2021-08-14T15:50:39.600+00", "max_run_time": 0.179, "submission_id": "7519099", "end_contest_time": "0:50:46.917", "judgement_type_id": "AC", "start_contest_time": "0:50:39.600"}}

210819 125853.764|INFO|RemoteEventFeedMonitorThread|run|Found judgement event

210819 125853.780|INFO|RemoteEventFeedMonitorThread|run|Got event string: {"type": "judgements", "id": "b286fadd-2ba4-4ccb-9115-658bb72e8001", "op": "create", "data": {"id": "9896173", "end_time": null, "start_time": "2021-08-14T15:50:47.178+00", "max_run_time": null, "submission_id": "7519100", "end_contest_time": null, "judgement_type_id": null, "start_contest_time": "0:50:47.178"}}

210819 125853.780|INFO|RemoteEventFeedMonitorThread|run|Found event: judgements:{"type": "judgements", "id": "b286fadd-2ba4-4ccb-9115-658bb72e8001", "op": "create", "data": {"id": "9896173", "end_time": null, "start_time": "2021-08-14T15:50:47.178+00", "max_run_time": null, "submission_id": "7519100", "end_contest_time": null, "judgement_type_id": null, "start_contest_time": "0:50:47.178"}}

210819 125853.780|INFO|RemoteEventFeedMonitorThread|run|Found judgement event
...
210819 125853.860|INFO|RemoteEventFeedMonitorThread|run|Got event string: {"type": "judgements", "id": "ed116abb-f23d-495f-a153-8e51006db5a2", "op": "update", "data": {"id": "9896173", "end_time": "2021-08-14T15:50:49.245+00", "start_time": "2021-08-14T15:50:47.178+00", "max_run_time": 0.003, "submission_id": "7519100", "end_contest_time": "0:50:49.245", "judgement_type_id": "WA", "start_contest_time": "0:50:47.178"}}
210819 125853.860|INFO|RemoteEventFeedMonitorThread|run|Found event: judgements:{"type": "judgements", "id": "ed116abb-f23d-495f-a153-8e51006db5a2", "op": "update", "data": {"id": "9896173", "end_time": "2021-08-14T15:50:49.245+00", "start_time": "2021-08-14T15:50:47.178+00", "max_run_time": 0.003, "submission_id": "7519100", "end_contest_time": "0:50:49.245", "judgement_type_id": "WA", "start_contest_time": "0:50:47.178"}}

210819 125853.860|INFO|RemoteEventFeedMonitorThread|run|Found judgement event
...
210819 125853.876|INFO|RemoteEventFeedMonitorThread|run|Got event string: {"type": "commentary", "id": "2df3a479-49c8-4c7a-906a-791ef08c7eb6", "op": "create", "data": {"id": "227337", "time": "2021-08-14T15:50:50.529+00", "message": "#t336519(University of Texas at Austin) fails again on #pnac21.thepaladin due to WA", "team_ids": ["336519"], "problem_ids": ["nac21.thepaladin"], "contest_time": "0:50:50.529"}}
210819 125853.876|INFO|RemoteEventFeedMonitorThread|run|Found event: commentary:{"type": "commentary", "id": "2df3a479-49c8-4c7a-906a-791ef08c7eb6", "op": "create", "data": {"id": "227337", "time": "2021-08-14T15:50:50.529+00", "message": "#t336519(University of Texas at Austin) fails again on #pnac21.thepaladin due to WA", "team_ids": ["336519"], "problem_ids": ["nac21.thepaladin"], "contest_time": "0:50:50.529"}}
210819 125853.876|INFO|RemoteEventFeedMonitorThread|run|Ignoring commentary event

Note that the log shows an exception was thrown immediately after the message "Fetching files from remote system". The exception message is "Exception parsing event...", but that is a bogus message (corrected in PR #284); the exception actually occurs while processing the "submission" event, not while parsing it.

Specifically, the exception appears to happen on the following line in the RemoteEventFeedMonitor:

 files = remoteContestAPIAdapter.getRemoteSubmissionFiles("" + overrideSubmissionID);

MOST of the time this works fine (i.e., returns the files from the remote system corresponding to the specified submissionID). However, ONCE IN A WHILE it throws "SocketTimeoutException". (Specifically, for the test I ran it lost 4 submissions out of 737; see the attached file: NAC21.Rerun.8.19.21.LostSubmissions.txt .)

The log NAC21.Rerun.8.19.21.FEEDER1@site1-0.log shows two interesting facts:

There is a fairly lengthy delay between the most recent log message which actually indicates a remote system access preceding the exception (delays of more than 5 seconds, in some case over 10 seconds);
every one of the SocketTimeout exception messages is preceded by a lengthy set of dumpPacket log messages.

These may indicate that the socket timeout just isn't set to be long enough, and the extensive logging which is done by dumppacket is causing the socket to time out.

To Reproduce: There doesn't seem to be any guaranteed way to reproduce this; it is timing-dependent. However, in three NAC runs, submissions were lost in all three runs, so it seems like it's likely to happen just by following the steps listed in PR #284.

Environment in which this was detected: A shadow running in Eclipse under Windows 10 and java 1.8_201, with a server running on the AWS pc2 server, an Admin running on the AWS "System2" machine, and AutoJudges running on the AWS Judge1-Judge5 machines.

Possible solutions:

Find where the event feed socket timeout is set, and increase it.
Consider removing all the dumpPacket calls which are generating large amounts of logging and at the same time stopping other code from executing (the RemoteEventFeedMonitor is probably blocked on accessing the remote event feed while it's waiting for logging to complete...)

clevengr commented 3 years ago

Initial investigation showed that the socket timeout on the event feed socket is set to 130 seconds (in method RemoteContestAPIAdapter.connect()). This should mean that the event feed socket won't time out for a little over 2 minutes (whereas the above logs indicate a timeout is happening after around 10 seconds).

On further review, it appears that the SocketTimeoutException may not actually be coming from the event feed socket at all. The line of code referenced above,

files = remoteContestAPIAdapter.getRemoteSubmissionFiles("" + overrideSubmissionID);

in the RemoteEventFeedMonitor class calls getRemoteSubmissionFiles(String submissionID) in the RemoteContestAPIAdapter class. This in turn calls getRemoteSubmissionFiles(URL), which calls createConnection(url) to create a NEW connection -- not to the event feed, but rather to the /submissions/<submissionID>/files endpoint of the remote CCS API.

createConnection() in turn calls pc2.util.HTTPSSecurity.createConnection() to create this new connection -- and the last statement therein is

            conn.setConnectTimeout(10000);

So the connection to the remote CCS endpoint DOES have a 10-second timeout.

After creating the connection, method getRemoteSubmissionFiles(url) call conn.getInputStream(), whose javadoc says:

Returns an input stream that reads from this open connection.A SocketTimeoutException can be thrown when reading from the returned input stream if the read timeout expires before data is available for read.

This seems like the likely source of the SocketTimeoutException which is causing submissions to be lost. Recommend increasing this timeout.

clevengr commented 3 years ago

I increased the timeout to 60 seconds on the connection which is being created by the call to getRemoteSubmissionFiles(URL). The logs still showed an occasional lost submission, and in every case it was due to a SocketTimeoutException, and the interval between the log messages indicating initiation of the connection and the log message showing the timeout exception was almost exactly 10 seconds (the same as what was documented in the logs previously attached to this Issue, above).

I added debug statements to method RemoteContestAPIAdapter.createConnection(URL), as follows:

            HttpURLConnection conn = HTTPSSecurity.createConnection(url2, login, password);

            //debug
            System.err.println ("RemoteContestAPIAdapter: created connection to " + url2 + "; default timeout = " + conn.getConnectTimeout());
            System.err.println ("RemoteContestAPIAdapter: setting connection timeout to 60 seconds");

            conn.setReadTimeout(60000);

            //debug
            System.err.println ("RemoteContestAPIAdapter: connection timeout read back after setting = " + conn.getConnectTimeout());

The resulting output was:

RemoteContestAPIAdapter: created connection to https://nac21.kattis.com/clics-api/contests/nac21; default timeout = 10000
RemoteContestAPIAdapter: setting connection timeout to 60 seconds
RemoteContestAPIAdapter: connection timeout read back after setting = 10000

Note that the socket timeout is still 10 seconds even after invoking setReadTimeout(60000)!

I then looked at the JavaDoc for method setConnectTimeout(int), in Java 1.8 class URLConnection , which is the (abstract) parent class of HttpURLConnection); that doc is at https://docs.oracle.com/javase/8/docs/api/java/net/URLConnection.html#setConnectTimeout-int-.

That JavaDoc contains the following statement:

Some non-standard implementation of this method may ignore the specified timeout.

This issue is apparently being caused by a Java bug! (Ok, a "documented java feature"... ) This "feature" is being exposed when the PC2 dumpPacket processing which occurs on unrelated threads takes more time than the (default, apparently unchangeable) 10-second timeout.

The solution is going to have to involve checking for the timeout exception and repeatedly retrying the connection, hoping to get through at some point without hitting massive dumpPacket logging. (Either that, or reduce the amount of dumpPacket logging that we're doing...)

clevengr commented 3 years ago

I've added a "retry up to 10 times" look when a SocketTimeoutException occurs. I reran shadowing on the NAC21 contest four separate times. Every time, between 5 and 10 submissions encountered "timeout" exceptions. In every case the timeout exception was preceded by a large number of dumpPacket logging operations. In every case the "retry" loop successfully acquired the desired files from the remote CCS on the very first retry.

I'll be pushing the code and submitting a PR for this fix.

pc2ccs / pc2v9

Shadow loses submissions due to SocketTimeoutException #286