rfeng2023 / mmcloud

0 stars 9 forks source link

Stuck at "initialize"? #50

Closed gaow closed 6 months ago

gaow commented 6 months ago

We have several jobs that stays in Initializing for a couple of hours like this:

image

What is curious to me is that the event log is really short and seems to "stuck" at 3 min after the initial submission, see below:

2024-02-24T00:50:06.138: Using cached instance params: Zone:,InstType:r5.large,CPU:2,Memory:16,OnDemand:false
2024-02-24T00:50:06.877: Job submitted
2024-02-24T00:50:06.893: Handling job at attempt #0 now...
2024-02-24T00:50:08.157: Job status changed: [Submitted -> Initializing]
2024-02-24T00:50:08.157: Creating VM with multiple instance type at retry #0 now...
2024-02-24T00:50:08.157: Ready to create instance with instType: r5.large, cpu: 2, mem: 16, zone: , payType: Spot
2024-02-24T00:50:14.162: Created instance i-000022d5327e25dcb(r5.large-Spot) at us-east-1b, waiting for it to initialize
2024-02-24T00:52:25.27: Registered one new host
2024-02-24T00:52:29.183: i-000022d5327e25dcb initialized
2024-02-24T00:52:29.183: Mounted s3:::us-east-1:statfungen:/ftp_fgc_xqtl:/home/bst2126/data:readonly to i-000022d5327e25dcb
2024-02-24T00:52:29.183: Mounted s3:::us-east-1:statfungen:/ftp_fgc_xqtl/analysis_result/finemapping_twas:/home/bst2126/output:readwrite to i-000022d5327e25dcb
2024-02-24T00:52:43.42: Created instance: i-000022d5327e25dcb(Spot)
2024-02-24T00:52:57.52: Creating float data volume now
2024-02-24T00:52:57.52: Ready to create volume with size: 20, path: /mnt/float-data, skip disk types :[]
2024-02-24T00:53:04.248: Created volume vol-070897f1c0b4713f4(gp2), size: 20, throughput: 0, iops: 100

So for about a couple of hours there are no further updates to this event file. It's hard for me to tell from the OpCenter log what else is going on for this particular job. The job in question is siq7gq2hcrrqc5evv8x1g in our new opcenter.

gaow commented 6 months ago

These 32 jobs stuck for 4.5 hours without going anywhere. I cancelled them manually and resubmitted. The resubmitted work instantly.

Ashley-Tung commented 6 months ago

Taking a look the job siq7gq2hcrrqc5evv8x1g you mentioned, the job log file does not match the job id. The log is for job id h4urgze0clu8087gnjuh5, but I will look at both.

So the instance logs for siq7gq2hcrrqc5evv8x1g exist, I had to go into the opcenter to find them. I see this in the fagent.log

time="2024-02-24T00:52:32.104" level=error msg="Failed to get faults from opcenter" error="Access to the engineering only API is not allowed" local-addr="0.0.0.0:443" oc-addr="172.31.17.115:443"
time="2024-02-24T00:52:32.104" level=error msg="Failed to get fault from opcenter" error="Access to the engineering only API is not allowed" local-addr="0.0.0.0:443" oc-addr="172.31.17.115:443"

Afterwards, it registers with the opcenter, then it seems we get nothing else from it until the instance is interrupted, which is much longer after the time you canceled it

time="2024-02-24T00:57:15.573" level=info msg="Registration succeeded." opcenter="172.31.17.115:443"
time="2024-02-24T12:20:41.062" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.069" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.069" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.075" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.075" level=info msg="Going to notify OpCenter" opcenter="172.31.17.115:443" type=interrupted
time="2024-02-24T12:20:41.082" level=error msg="Failed to send notification request." error="Invalid argument, no job found for i-03abd568645fbe8e5" opcenter="172.31.17.115:443" type=interrupted

For h4urgze0clu8087gnjuh5, you saw it stuck at volume setup. Looking at fagent.log, the agent just shuts down

time="2024-02-24T04:28:16.799" level=warning msg="Agent is going to shutdown"
time="2024-02-24T04:28:16.799" level=info msg="Ready to backup system logs"
time="2024-02-24T04:28:16.878" level=info msg="Backup system log successfully" out=
time="2024-02-24T04:28:17.028" level=info msg="Backup system log successfully" out=
time="2024-02-24T04:28:17.694" level=info msg="stopping the http(s) servers"
time="2024-02-24T04:28:17.695" level=info msg="http: Server closed" addr="0.0.0.0:443"
time="2024-02-24T04:28:17.695" level=info msg="Server stopped"

To summarize, it seems there is a communication issue between the instances and the OpCenter. Since I cannot find anything explicitly stated in the logs, I have let engineering know of this issue.

Ashley-Tung commented 6 months ago

Did a check on your opcenter. It seems the last time it was restarted was around 2024-02-24T00:57...

[admin]$ float status
id: 7cfc0037-3ac4-41fe-999d-da8d278c3108
Server Status: normal
API Request Status: normal
License Status: valid
Init Time: "2024-01-16T19:41:42Z"
Up Time: "2024-02-24T00:57:10Z"
Current Time: "2024-02-27T00:47:14Z"

This actually coincides with the time around the jobs not getting updated information. I believe we found the source of the problem.

Ashley-Tung commented 6 months ago

Update from Slack, the latest hotfix will address this