Closed gaow closed 8 months ago
As you can see many jobs are still in this long Initializing
queue, even though there are plenty candidate instances still available for the specification i require:
it's not obvious what the issue is. Eng suspects that there are some quota constraints on our AWS account but it does not seem likely. It is also unlikely that there is actually no such instance in the region. In fact when test from another person's account it works right away.
The error message from opcenter.log
is
time="2024-02-18T02:18:52.144" level=info msg="Ready to create instance" AMI= instanceType=r6i.xlarge on-demand=false priceLimit=0 priceLimitPercent=0 publicIP=true zone=us-east-1b
time="2024-02-18T02:18:53.001" level=warning msg="Failed to create instance" error="MaxSpotInstanceCountExceeded: Max spot instance count exceeded\n\tstatus code: 400, request id: 047204fa-5b9a-4514-9707-9e1ad060413b"
time="2024-02-18T02:18:53.001" level=error msg="Failed to create cloud instance" error="No instance available (code: 8141)"
but at the time of this error we only have 121 spot VM running.
On our end, we will try to submit from another opcenter in the same region, to see if there are some limits per opcenter being imposed. MemVerge will keep looking into possible reasons behind it.
I believe we found out it was quota constraints due to the quota increase being in the wrong region?
You are right @Ashley-Tung it was a mis-communication on our end.
For some of the jobs under status
Initializing
, I checked the job.events:You see that there are many VM available, but it decides to end up on an instance that are not available,
I guess perhaps many of my other jobs are also retrying the same candidate instance, therefore creates a competing situation that makes it difficult for this job to get the correct instance? This seems like something the engineering side can improve -- basically for jobs from the same opcenter under
Initializing
status (likely by the same research team), would it be possible to collect all of the information and try to allocate resources to optimize all jobs rather than not knowing other jobs and letting them compete within one opcenter?