Closed alahiff closed 5 years ago
Tried a workflow involving a job with 40 instances, which all failed quickly (accidently). All infrastructure was deployed then deleted successfully. Repeated with 60 quickly failing instances, and this was also all good.
Also ran another workflow with 60 longer running jobs (~1 hour), no probems.
Repeat tests once https://github.com/prominence-eosc/imc/issues/19 has been implemented
First tests with PostgreSQL backend, size of pool is 8:
No DB errors for 10 jobs, but got this for one job:
Exception deploying infrastructure: "string indices must be integers, not str"
No DB errors for 20 jobs, but one job seemed to be stuck in running (17705), which is unlikely to be related to IMC.
For 40 jobs, first 30 deployed successfully (router has idle limit of 30). condor_startds appeared to all join successfully, but jobs stayed in the ready state for a long time. Some startds started automatically dying. Issue tracked here: https://github.com/prominence-eosc/prominence/issues/26. No DB issues.
Increased pool size to 16 (from 8):
For 160 jobs, got lots of these:
CRITICAL [imc] Deployment error, this is a bug: expected a string or other character buffer object
Also one of these:
CRITICAL [imc] Exception deploying infrastructure: "global name 'token' is not defined"
This error is probably from the # Final check if we should delete the infrastructure
section where utilities.create_im_auth
has token
as an argument but is not defined.
Also encountered this: https://github.com/prominence-eosc/imc/issues/21
Also, there were some leftover infrastructures in IM. Note that I deleted around 50-70 in various states.
The token issue is fixed in https://github.com/prominence-eosc/imc/commit/3a1a8e80b3791ec79a81cc8d598ae51d81d43f52
Otber improvements to deletion handling also included in this commit.
Also trying:
The deployment exception seems to be fixed with https://github.com/prominence-eosc/imc/commit/a7ef686b913bc60e1f26f6f0f5f4c71ac9530504
320 jobs with router max idle 60, pool size 24 had no problems.
Carry out some basic tests running many jobs across multiple clouds (this hasn't been done yet since the switch to REST API)