rfeng2023 / mmcloud

1 stars 10 forks source link

Jobs still "Initializing" even when VM required are available #35

Closed gaow closed 8 months ago

gaow commented 8 months ago

For some of the jobs under status Initializing, I checked the job.events:

image

You see that there are many VM available, but it decides to end up on an instance that are not available,

2024-02-12T05:44:31.606: Found candidate instance type r7i.4xlarge, price 0.2660999894142151
2024-02-12T05:44:31.606: Found candidate instance type r7iz.2xlarge, price 0.34790000319480896
2024-02-12T05:44:31.606: Found candidate instance type r7iz.4xlarge, price 0.6819000244140625
2024-02-12T05:44:31.606: Found candidate instance type r7iz.xlarge, price 0.14480000734329224
2024-02-12T05:44:31.632: Determined instance params: Zone:,InstType:r6id.xlarge,CPU:4,Memory:32,OnDemand:false
2024-02-12T05:44:31.632: Ready to create instance with instType: r6id.xlarge, cpu: 4, mem: 32, zone: , payType: Spot
2024-02-12T05:45:10.655: Failed to created instance, error: No instance available (code: 8141)
2024-02-12T05:45:10.655: Creating VM with multiple instance type at retry #6 now...

I guess perhaps many of my other jobs are also retrying the same candidate instance, therefore creates a competing situation that makes it difficult for this job to get the correct instance? This seems like something the engineering side can improve -- basically for jobs from the same opcenter under Initializing status (likely by the same research team), would it be possible to collect all of the information and try to allocate resources to optimize all jobs rather than not knowing other jobs and letting them compete within one opcenter?

gaow commented 8 months ago

As you can see many jobs are still in this long Initializing queue, even though there are plenty candidate instances still available for the specification i require:

image

gaow commented 8 months ago

it's not obvious what the issue is. Eng suspects that there are some quota constraints on our AWS account but it does not seem likely. It is also unlikely that there is actually no such instance in the region. In fact when test from another person's account it works right away.

The error message from opcenter.log is

time="2024-02-18T02:18:52.144" level=info msg="Ready to create instance" AMI= instanceType=r6i.xlarge on-demand=false priceLimit=0 priceLimitPercent=0 publicIP=true zone=us-east-1b
time="2024-02-18T02:18:53.001" level=warning msg="Failed to create instance" error="MaxSpotInstanceCountExceeded: Max spot instance count exceeded\n\tstatus code: 400, request id: 047204fa-5b9a-4514-9707-9e1ad060413b"
time="2024-02-18T02:18:53.001" level=error msg="Failed to create cloud instance" error="No instance available (code: 8141)"

but at the time of this error we only have 121 spot VM running.

On our end, we will try to submit from another opcenter in the same region, to see if there are some limits per opcenter being imposed. MemVerge will keep looking into possible reasons behind it.

Ashley-Tung commented 8 months ago

I believe we found out it was quota constraints due to the quota increase being in the wrong region?

gaow commented 8 months ago

You are right @Ashley-Tung it was a mis-communication on our end.