Closed clevengr closed 3 years ago
Initial investigation showed that the socket timeout on the event feed socket is set to 130 seconds (in method RemoteContestAPIAdapter.connect()
). This should mean that the event feed socket won't time out for a little over 2 minutes (whereas the above logs indicate a timeout is happening after around 10 seconds).
On further review, it appears that the SocketTimeoutException may not actually be coming from the event feed socket at all. The line of code referenced above,
files = remoteContestAPIAdapter.getRemoteSubmissionFiles("" + overrideSubmissionID);
in the RemoteEventFeedMonitor
class calls getRemoteSubmissionFiles(String submissionID)
in the RemoteContestAPIAdapter
class. This in turn calls getRemoteSubmissionFiles(URL)
, which calls createConnection(url)
to create a NEW connection -- not to the event feed, but rather to the /submissions/<submissionID>/files
endpoint of the remote CCS API.
createConnection()
in turn calls pc2.util.HTTPSSecurity.createConnection()
to create this new connection -- and the last statement therein is
conn.setConnectTimeout(10000);
So the connection to the remote CCS
After creating the connection, method getRemoteSubmissionFiles(url)
call conn.getInputStream()
, whose javadoc says:
Returns an input stream that reads from this open connection.A SocketTimeoutException can be thrown when reading from the returned input stream if the read timeout expires before data is available for read.
This seems like the likely source of the SocketTimeoutException which is causing submissions to be lost. Recommend increasing this timeout.
I increased the timeout to 60 seconds on the connection which is being created by the call to getRemoteSubmissionFiles(URL)
. The logs still showed an occasional lost submission, and in every case it was due to a SocketTimeoutException
, and the interval between the log messages indicating initiation of the connection and the log message showing the timeout exception was almost exactly 10 seconds (the same as what was documented in the logs previously attached to this Issue, above).
I added debug statements to method RemoteContestAPIAdapter.createConnection(URL)
, as follows:
HttpURLConnection conn = HTTPSSecurity.createConnection(url2, login, password);
//debug
System.err.println ("RemoteContestAPIAdapter: created connection to " + url2 + "; default timeout = " + conn.getConnectTimeout());
System.err.println ("RemoteContestAPIAdapter: setting connection timeout to 60 seconds");
conn.setReadTimeout(60000);
//debug
System.err.println ("RemoteContestAPIAdapter: connection timeout read back after setting = " + conn.getConnectTimeout());
The resulting output was:
RemoteContestAPIAdapter: created connection to https://nac21.kattis.com/clics-api/contests/nac21; default timeout = 10000
RemoteContestAPIAdapter: setting connection timeout to 60 seconds
RemoteContestAPIAdapter: connection timeout read back after setting = 10000
Note that the socket timeout is still 10 seconds even after invoking setReadTimeout(60000)
!
I then looked at the JavaDoc for method setConnectTimeout(int)
, in Java 1.8 class URLConnection
, which is the (abstract) parent class of HttpURLConnection); that doc is at https://docs.oracle.com/javase/8/docs/api/java/net/URLConnection.html#setConnectTimeout-int-.
That JavaDoc contains the following statement:
Some non-standard implementation of this method may ignore the specified timeout.
This issue is apparently being caused by a Java bug! (Ok, a "documented java feature"... ) This "feature" is being exposed when the PC2 dumpPacket
processing which occurs on unrelated threads takes more time than the (default, apparently unchangeable) 10-second timeout.
The solution is going to have to involve checking for the timeout exception and repeatedly retrying the connection, hoping to get through at some point without hitting massive dumpPacket
logging. (Either that, or reduce the amount of dumpPacket
logging that we're doing...)
I've added a "retry up to 10 times" look when a SocketTimeoutException
occurs. I reran shadowing on the NAC21 contest four separate times. Every time, between 5 and 10 submissions encountered "timeout" exceptions. In every case the timeout exception was preceded by a large number of dumpPacket
logging operations. In every case the "retry" loop successfully acquired the desired files from the remote CCS on the very first retry.
I'll be pushing the code and submitting a PR for this fix.
Describe the issue:
The PC2 Shadow occasionally loses submissions which come from the remote CCS, which in turn sometimes causes failures in matching the scoreboards. (The reason it only sometimes causes scoreboard failures is that a lost submission for a problem might not be a submission that affects scoring.)
It appears that the loss of submissions may be related to the shadow throwing an exception when attempting to pull the FILES associated with the submission from the remote CCS. Specifically, I ran a shadow against a rerun of the NAC21 contest and, after detailed analysis, determined that the shadow received a notice from Kattis that there existed a submission with id 7519100, but no such submission was entered into PC2. The relevant log messages are given below; the complete log is attached below.
(Note: submission 7519100 is just one example; there were also other submissions lost in my test.)
Note that the log shows an exception was thrown immediately after the message "Fetching files from remote system". The exception message is "Exception parsing event...", but that is a bogus message (corrected in PR #284); the exception actually occurs while processing the "submission" event, not while parsing it.
Specifically, the exception appears to happen on the following line in the RemoteEventFeedMonitor:
MOST of the time this works fine (i.e., returns the files from the remote system corresponding to the specified submissionID). However, ONCE IN A WHILE it throws "SocketTimeoutException". (Specifically, for the test I ran it lost 4 submissions out of 737; see the attached file: NAC21.Rerun.8.19.21.LostSubmissions.txt .)
The log NAC21.Rerun.8.19.21.FEEDER1@site1-0.log shows two interesting facts:
dumpPacket
log messages.These may indicate that the socket timeout just isn't set to be long enough, and the extensive logging which is done by
dumppacket
is causing the socket to time out.To Reproduce: There doesn't seem to be any guaranteed way to reproduce this; it is timing-dependent. However, in three NAC runs, submissions were lost in all three runs, so it seems like it's likely to happen just by following the steps listed in PR #284.
Environment in which this was detected: A shadow running in Eclipse under Windows 10 and java 1.8_201, with a server running on the AWS pc2 server, an Admin running on the AWS "System2" machine, and AutoJudges running on the AWS Judge1-Judge5 machines.
Possible solutions:
dumpPacket
calls which are generating large amounts of logging and at the same time stopping other code from executing (the RemoteEventFeedMonitor is probably blocked on accessing the remote event feed while it's waiting for logging to complete...)