rfeng2023 / mmcloud

1 stars 10 forks source link

Submission commands hang #57

Closed gaow closed 6 months ago

gaow commented 7 months ago

Moving this from GitHub ,

sometimes the submission goes quick and smooth,

status: Submitted submitTime: "2024-03-07T01:59:48Z" duration: 0s queueTime: 0s

but some times for example at this point,

status: Submitted submitTime: "2024-03-07T02:15:59Z" duration: 1m0s queueTime: 40s

each job even takes 1min to run and long time in the queue

I wonder if it is possible for the opcenter to just accept all jobs and if it cannot process them they should remain the submitted status until processed? It does not seem a good behavior to let float itself "hang" on the terminal at the user's end

gaow commented 7 months ago

From @Ashley-Tung

Had a talk with engineering, the cause of the issue is likely this: The queue time is the delta between job submitted and job initializing and it may be large because we have one thread to pick up jobs in queue to dispatch to other threads to process. If opcenter is busy, the dispatching thread may wait about 10s. The busyness could come from high cpu load in the opcenter, or ebs performance With the release of 2.5.1, there will be opcenter metrics that can be provided to better visualize this For now, there is no plan to change the way the opcenter queues jobs. We have done a test on our end with 500+ jobs on a smaller opcenter with little to no queue time. It is possible this is an issue that is a one-time thing/inconsistent in when it appears depending on how busy the opcenter is My original suggestion still stands, in that using a terminal multiplexer to let the wrapper script run until all commands are complete is the best we can do right now.

gaow commented 7 months ago

if the only reason for the delay is the opcenter being busy then def there is a bug. Because yesterday when I submit the opcenter was doing nothing -- perhaps running a couple of jobs only. Sounds like this is an unexpected behavior that should be investigated.

This is not a one-time thing. It happens enough times to become annoying. We can use something like tux but that would not help when i need to turn off the computer before i sign off every day.

gaow commented 7 months ago

From @Ashley-Tung

a ticket has been filed for Engineering! Thank you for your patience. They will be taking a look at this shortly   

gaow commented 7 months ago

Update : we will try to implement #67 as a potential solution to it and see if it works or not.

Ashley-Tung commented 7 months ago

[UPDATE] While it has been some time since the original issue, it has appeared again. This time, with 2.5.2 opcenter metrics available, the likely cause to the most recent situation is due to the lower IOPS of gp2 data volumes. Currently, we will wait for the batch of jobs on the east coast opcenter to finish before upgrading the data volumes to gp3

Ashley-Tung commented 6 months ago

[UPDATE] All three opcenters should now have gp3 data volumes now. @gaow and @rfeng2023 once new jobs have been submitted with no login latency and no long queue time, I will close this ticket once we confirm.

Ashley-Tung commented 6 months ago

Hi @gaow I saw the latest batch of jobs did not experience a long queue time. Would you like me to close this ticket, or confirm again with another batch of jobs later?