Closed tbugfinder closed 6 years ago
What's the best practice referenced? Is there any link?
Any feedback about this?
That's indeed the link AWS support is referring to.
I've just uploaded a snapshot implementing both submit requests throttler and fixing #709. This uses a brand new thread pool able to scale down when a TooManyRequestExecption
is raised and gradually scaling up the submission rate, when there are no errors.
On the first tests it looks working fine. Using a remote connection is was able to submit to 20 job / secs (starting from 200/sec).
I guess there's still margin to improve it fine tuning some thread pool parameters for real-world workloads. These are the available parameters with the current defaults:
executor {
submitter {
rateLimit = '100/s'
autoThrottle = true
poolSize = <num cpus * 5>
queueSize = 5000
maxPoolSize = <same as poolSize>,
maxRetries = <unlimited>
keepAlive = '1 min'
errorBurstDelay = '1 sec'
backOffMinRate = 0
backOffFactor = 2
rampUpInterval = 100,
rampUpFactor = 1.2
rampUpMaxRate <unlimited>
}
}
To use this snapshot use the following command:
CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info
it should print,
Version: 0.31.0-SNAPSHOT build 4897
Modified: 05-07-2018 16:57 UTC (18:57 CEST)
Then run:
NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
Should the setting poolSize = <num cpus * 5>
be based on the submission hosts specs?
Unfortunately it ended up in TooManyRequestsException (queueSize = 500000
- run as many jobs as possible).
Jul-05 18:56:30.878 [ThrottlerExecutor-39] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.61/sec
Jul-05 18:56:30.974 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'runROCS (21c2ccad-f6ba-4305-8437-0acd3cf3f211_Acylation_6.smi_aaadf_split_0000002.oeb.gz_8772
1732)'
Caused by:
Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 21210e13-8085-11e8-b137-132bbfe4e0ee)
com.amazonaws.services.batch.model.AWSBatchException: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 21210e13-8085-11e8-b137-132bbfe4e0ee
)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1632)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.batch.AWSBatchClient.doInvoke(AWSBatchClient.java:1268)
at com.amazonaws.services.batch.AWSBatchClient.invoke(AWSBatchClient.java:1244)
at com.amazonaws.services.batch.AWSBatchClient.executeDescribeJobs(AWSBatchClient.java:871)
at com.amazonaws.services.batch.AWSBatchClient.describeJobs(AWSBatchClient.java:846)
at nextflow.cloud.aws.batch.AwsBatchTaskHandler.describeJob(AwsBatchTaskHandler.groovy:182)
at nextflow.cloud.aws.batch.AwsBatchTaskHandler.checkIfRunning(AwsBatchTaskHandler.groovy:210)
at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:585)
at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:517)
at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:398)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1034)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:947)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:930)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:92)
at nextflow.processor.TaskPollingMonitor$_start_closure4.doCall(TaskPollingMonitor.groovy:299)
at nextflow.processor.TaskPollingMonitor$_start_closure4.call(TaskPollingMonitor.groovy)
at groovy.lang.Closure.run(Closure.java:499)
at java.lang.Thread.run(Thread.java:748)
Jul-05 18:56:30.980 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 2121
0e13-8085-11e8-b137-132bbfe4e0ee)
Logfile Output:
$ grep submission .nextflow.log
Jul-05 18:54:47.193 [ThrottlerExecutor-3] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 120.00/sec
Jul-05 18:54:47.415 [ThrottlerExecutor-40] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 60.00/sec
Jul-05 18:54:49.544 [ThrottlerExecutor-5] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 30.00/sec
Jul-05 18:54:50.668 [ThrottlerExecutor-23] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 15.00/sec
Jul-05 18:54:51.907 [ThrottlerExecutor-33] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 18.00/sec
Jul-05 18:54:52.937 [ThrottlerExecutor-31] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 9.00/sec
Jul-05 18:54:55.471 [ThrottlerExecutor-25] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 4.50/sec
Jul-05 18:54:58.357 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 5.40/sec
Jul-05 18:55:17.021 [ThrottlerExecutor-21] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 6.48/sec
Jul-05 18:55:33.791 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 7.78/sec
Jul-05 18:55:47.691 [ThrottlerExecutor-15] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 9.33/sec
Jul-05 18:55:59.142 [ThrottlerExecutor-2] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.20/sec
Jul-05 18:56:08.816 [ThrottlerExecutor-15] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 13.44/sec
Jul-05 18:56:16.848 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 16.12/sec
Jul-05 18:56:23.556 [ThrottlerExecutor-23] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 19.35/sec
Jul-05 18:56:30.126 [ThrottlerExecutor-16] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 23.22/sec
Jul-05 18:56:30.878 [ThrottlerExecutor-39] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.61/sec
This is really appreciated!!!
Is throttling also applied to the self-cleanup after raising the error during submission?
Jul-05 19:06:48.483 [main] DEBUG n.c.aws.batch.AwsBatchTaskHandler - [AWS BATCH] killing job=4d096338-5603-4fa6-b2f5-84d77223c313
Jul-05 19:06:51.418 [main] DEBUG n.processor.TaskPollingMonitor - Failed to kill pending tasks: TaskHandler[id: 45; name: run (ab.gz_46126565); status: SUBMITTED; exit: -; error: -; workDir: s3://test/work/sum-b5ba01a6-ef51-4dc9-bf2b-37938ab10189/12/b0253934be4bf6f50bb9e98fb60597] -- cause: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 92f20ec4-8086-11e8-b5a2-735a2f84c681)
Jul-05 19:06:51.418 [main] DEBUG n.c.aws.batch.AwsBatchTaskHandler - [AWS BATCH] killing job=204f9817-0040-4875-9f68-2e2cb66c9475
Ouch .. no. It's more complicated than expected. But at this point it become a personal challenge.
Do you also share the snapshot source code on github?
I've uploaded a new snapshot the provide an improved version of the throttling thread pool. This version handle all AWS batch api call. To test it use the the above commands, it should report the following timestamp and build number:
Version: 0.30.2 build 4881
Modified: 30-06-2018 14:35 UTC (16:35 CEST)
Note: setting queueSize = 500000
don't think it helps too much, it only consumes more memory. The critical setting is the size of the thread pool that controls the parallel request submission. However my gut feeling is that trying to use a too high submission rate only case AWS Batch to return a TooManyRequestException
.
Should the setting poolSize = <num cpus * 5> be based on the submission hosts specs?
It depends, by default the max number of thread is set to num avail cpus * 5. I think it should be a good compromise. You may want to tune using a different setting in the config file.
Do you also share the snapshot source code on github?
Only when I think it's stable enough to be published.
There's still a glitch. Don't test until further notice.
I've uploaded a new snapshot that should manage correctly throttling for all requests.
Version: 0.31.0-SNAPSHOT build 4926
Modified: 08-07-2018 22:29 UTC (09-07-2018 00:29 CEST)
As before update you snapshot with the command:
CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info
Then use:
NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
A brief explanation for the avail options can be found in the source code.
/cc @fstrozzi
First run was pretty good. I'll do some additional testings and also upload submission rate output. So far this is a great step forward. I'm impressed, although AWS API seems to be very strict in limiting API calls.
Yeah, I think they cap at 20/s
So multi-threaded sessions or array jobs could reduce the submission time once more. As you can see it's taken 5h now.
So, is it good?
So far so good, I tested this version and the difference in the submission speed is impressive . I'll test it more extensively and I'll report here any issue.
Still testing, also killing a running submission using CTRL+C.
....jobs....
Oops .. something went wrong
ERROR ~ Unexpected error [AbortedException]
-- Check script 'main.nf' at line: 54 or see '.nextflow.log' file for more details
[4c/15c5f4] Submitted process > ...
WARN: Killing pending tasks (266)
[a2/340c94] Submitted process > ......
One CTRL+C prints "Killing pending tasks" however job submission continues until a 2nd CTRL+C which cleans up submitted jobs using the new throttling method (very good).
Anyway, the submission speed is a major improvement - impressive.
Can you upload the .nextflow.log
file the execution reporting that error?
I don't know if this is raised by NF and I didn't pay attention to the compute environments, but there's an error displayed (flapping, though):
@fstrozzi have you any idea of what may cause this message?
@tbugfinder interesting, I've never seen a similar status on AWS Batch. Which kind of IAM setup do you use to interact with Batch ? Have you setup a IAM Profile or have you granted full Batch permissions to your IAM user ?
I've openend a call at AWS.
As per AWS support advise the throttling handling does not apply to AWS best practices.
Note from AWS Support:
First of all I would like to point that I have check the source code for the nextflow.io (https://github.com/nextflow-io/nextflow/blob/master/src/main/groovy/nextflow/executor/AwsBatchExecutor.groovy) and they do not apply the best practice and recommendation for Throttling handling on the AWS API's with the exponential back off[1], where when a error happen it should exponential wait and retry it again, queueing for some ms.