Closed gaow closed 6 months ago
These 32 jobs stuck for 4.5 hours without going anywhere. I cancelled them manually and resubmitted. The resubmitted work instantly.
Taking a look the job siq7gq2hcrrqc5evv8x1g
you mentioned, the job log file does not match the job id. The log is for job id h4urgze0clu8087gnjuh5
, but I will look at both.
So the instance logs for siq7gq2hcrrqc5evv8x1g
exist, I had to go into the opcenter to find them.
I see this in the fagent.log
time="2024-02-24T00:52:32.104" level=error msg="Failed to get faults from opcenter" error="Access to the engineering only API is not allowed" local-addr="0.0.0.0:443" oc-addr="172.31.17.115:443"
time="2024-02-24T00:52:32.104" level=error msg="Failed to get fault from opcenter" error="Access to the engineering only API is not allowed" local-addr="0.0.0.0:443" oc-addr="172.31.17.115:443"
Afterwards, it registers with the opcenter, then it seems we get nothing else from it until the instance is interrupted, which is much longer after the time you canceled it
time="2024-02-24T00:57:15.573" level=info msg="Registration succeeded." opcenter="172.31.17.115:443"
time="2024-02-24T12:20:41.062" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.069" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.069" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.075" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.075" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.082" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted
For h4urgze0clu8087gnjuh5
, you saw it stuck at volume setup. Looking at fagent.log
, the agent just shuts down
time="2024-02-24T04:28:16.799" level=warning msg="Agent is going to shutdown"
time="2024-02-24T04:28:16.799" level=info msg="Ready to backup system logs"
time="2024-02-24T04:28:16.878" level=info msg="Backup system log successfully" out=
time="2024-02-24T04:28:17.028" level=info msg="Backup system log successfully" out=
time="2024-02-24T04:28:17.694" level=info msg="stopping the http(s) servers"
time="2024-02-24T04:28:17.695" level=info msg="http: Server closed" addr="0.0.0.0:443"
time="2024-02-24T04:28:17.695" level=info msg="Server stopped"
To summarize, it seems there is a communication issue between the instances and the OpCenter. Since I cannot find anything explicitly stated in the logs, I have let engineering know of this issue.
Did a check on your opcenter. It seems the last time it was restarted was around 2024-02-24T00:57...
[admin]$ float status
id: 7cfc0037-3ac4-41fe-999d-da8d278c3108
Server Status: normal
API Request Status: normal
License Status: valid
Init Time: "2024-01-16T19:41:42Z"
Up Time: "2024-02-24T00:57:10Z"
Current Time: "2024-02-27T00:47:14Z"
This actually coincides with the time around the jobs not getting updated information. I believe we found the source of the problem.
Update from Slack, the latest hotfix will address this
We have several jobs that stays in Initializing for a couple of hours like this:
What is curious to me is that the event log is really short and seems to "stuck" at 3 min after the initial submission, see below:
So for about a couple of hours there are no further updates to this event file. It's hard for me to tell from the OpCenter log what else is going on for this particular job. The job in question is
siq7gq2hcrrqc5evv8x1g
in our new opcenter.