nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.74k stars 626 forks source link

awsbatch executor throttling handling #710

Closed tbugfinder closed 6 years ago

tbugfinder commented 6 years ago

As per AWS support advise the throttling handling does not apply to AWS best practices.

Note from AWS Support:

First of all I would like to point that I have check the source code for the nextflow.io (https://github.com/nextflow-io/nextflow/blob/master/src/main/groovy/nextflow/executor/AwsBatchExecutor.groovy) and they do not apply the best practice and recommendation for Throttling handling on the AWS API's with the exponential back off[1], where when a error happen it should exponential wait and retry it again, queueing for some ms.

pditommaso commented 6 years ago

What's the best practice referenced? Is there any link?

pditommaso commented 6 years ago

Any feedback about this?

pditommaso commented 6 years ago

Related https://docs.aws.amazon.com/general/latest/gr/api-retries.html

tbugfinder commented 6 years ago

That's indeed the link AWS support is referring to.

pditommaso commented 6 years ago

I've just uploaded a snapshot implementing both submit requests throttler and fixing #709. This uses a brand new thread pool able to scale down when a TooManyRequestExecption is raised and gradually scaling up the submission rate, when there are no errors.

On the first tests it looks working fine. Using a remote connection is was able to submit to 20 job / secs (starting from 200/sec).

I guess there's still margin to improve it fine tuning some thread pool parameters for real-world workloads. These are the available parameters with the current defaults:

executor {
  submitter {
             rateLimit = '100/s'
             autoThrottle = true
             poolSize = <num cpus * 5>
             queueSize = 5000
             maxPoolSize = <same as poolSize>,
             maxRetries = <unlimited>
             keepAlive = '1 min'
             errorBurstDelay = '1 sec'
             backOffMinRate = 0
             backOffFactor = 2
             rampUpInterval = 100,
             rampUpFactor = 1.2
             rampUpMaxRate <unlimited>

   } 
}  

To use this snapshot use the following command:

CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info

it should print,

  Version: 0.31.0-SNAPSHOT build 4897
  Modified: 05-07-2018 16:57 UTC (18:57 CEST)

Then run:

 NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc
tbugfinder commented 6 years ago

Should the setting poolSize = <num cpus * 5> be based on the submission hosts specs?

Unfortunately it ended up in TooManyRequestsException (queueSize = 500000 - run as many jobs as possible).

Jul-05 18:56:30.878 [ThrottlerExecutor-39] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.61/sec
Jul-05 18:56:30.974 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'runROCS (21c2ccad-f6ba-4305-8437-0acd3cf3f211_Acylation_6.smi_aaadf_split_0000002.oeb.gz_8772
1732)'

Caused by:
  Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 21210e13-8085-11e8-b137-132bbfe4e0ee)

com.amazonaws.services.batch.model.AWSBatchException: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 21210e13-8085-11e8-b137-132bbfe4e0ee
)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1632)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
        at com.amazonaws.services.batch.AWSBatchClient.doInvoke(AWSBatchClient.java:1268)
        at com.amazonaws.services.batch.AWSBatchClient.invoke(AWSBatchClient.java:1244)
        at com.amazonaws.services.batch.AWSBatchClient.executeDescribeJobs(AWSBatchClient.java:871)
        at com.amazonaws.services.batch.AWSBatchClient.describeJobs(AWSBatchClient.java:846)
        at nextflow.cloud.aws.batch.AwsBatchTaskHandler.describeJob(AwsBatchTaskHandler.groovy:182)
        at nextflow.cloud.aws.batch.AwsBatchTaskHandler.checkIfRunning(AwsBatchTaskHandler.groovy:210)
        at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:585)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:517)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:398)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1034)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:947)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:930)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:92)
        at nextflow.processor.TaskPollingMonitor$_start_closure4.doCall(TaskPollingMonitor.groovy:299)
        at nextflow.processor.TaskPollingMonitor$_start_closure4.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:499)
        at java.lang.Thread.run(Thread.java:748)
Jul-05 18:56:30.980 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 2121
0e13-8085-11e8-b137-132bbfe4e0ee)

Logfile Output:

$ grep submission .nextflow.log
Jul-05 18:54:47.193 [ThrottlerExecutor-3] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 120.00/sec
Jul-05 18:54:47.415 [ThrottlerExecutor-40] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 60.00/sec
Jul-05 18:54:49.544 [ThrottlerExecutor-5] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 30.00/sec
Jul-05 18:54:50.668 [ThrottlerExecutor-23] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 15.00/sec
Jul-05 18:54:51.907 [ThrottlerExecutor-33] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 18.00/sec
Jul-05 18:54:52.937 [ThrottlerExecutor-31] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 9.00/sec
Jul-05 18:54:55.471 [ThrottlerExecutor-25] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 4.50/sec
Jul-05 18:54:58.357 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 5.40/sec
Jul-05 18:55:17.021 [ThrottlerExecutor-21] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 6.48/sec
Jul-05 18:55:33.791 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 7.78/sec
Jul-05 18:55:47.691 [ThrottlerExecutor-15] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 9.33/sec
Jul-05 18:55:59.142 [ThrottlerExecutor-2] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.20/sec
Jul-05 18:56:08.816 [ThrottlerExecutor-15] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 13.44/sec
Jul-05 18:56:16.848 [ThrottlerExecutor-36] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 16.12/sec
Jul-05 18:56:23.556 [ThrottlerExecutor-23] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 19.35/sec
Jul-05 18:56:30.126 [ThrottlerExecutor-16] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 23.22/sec
Jul-05 18:56:30.878 [ThrottlerExecutor-39] DEBUG n.processor.ParallelPollingMonitor - New submission rate limit: 11.61/sec

This is really appreciated!!!

tbugfinder commented 6 years ago

Is throttling also applied to the self-cleanup after raising the error during submission?

Jul-05 19:06:48.483 [main] DEBUG n.c.aws.batch.AwsBatchTaskHandler - [AWS BATCH] killing job=4d096338-5603-4fa6-b2f5-84d77223c313
Jul-05 19:06:51.418 [main] DEBUG n.processor.TaskPollingMonitor - Failed to kill pending tasks: TaskHandler[id: 45; name: run (ab.gz_46126565); status: SUBMITTED; exit: -; error: -; workDir: s3://test/work/sum-b5ba01a6-ef51-4dc9-bf2b-37938ab10189/12/b0253934be4bf6f50bb9e98fb60597] -- cause: Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 92f20ec4-8086-11e8-b5a2-735a2f84c681)
Jul-05 19:06:51.418 [main] DEBUG n.c.aws.batch.AwsBatchTaskHandler - [AWS BATCH] killing job=204f9817-0040-4875-9f68-2e2cb66c9475
pditommaso commented 6 years ago

Ouch .. no. It's more complicated than expected. But at this point it become a personal challenge.

tbugfinder commented 6 years ago

Do you also share the snapshot source code on github?

pditommaso commented 6 years ago

I've uploaded a new snapshot the provide an improved version of the throttling thread pool. This version handle all AWS batch api call. To test it use the the above commands, it should report the following timestamp and build number:

  Version: 0.30.2 build 4881
  Modified: 30-06-2018 14:35 UTC (16:35 CEST)

Note: setting queueSize = 500000 don't think it helps too much, it only consumes more memory. The critical setting is the size of the thread pool that controls the parallel request submission. However my gut feeling is that trying to use a too high submission rate only case AWS Batch to return a TooManyRequestException.

Should the setting poolSize = <num cpus * 5> be based on the submission hosts specs?

It depends, by default the max number of thread is set to num avail cpus * 5. I think it should be a good compromise. You may want to tune using a different setting in the config file.

Do you also share the snapshot source code on github?

Only when I think it's stable enough to be published.

pditommaso commented 6 years ago

There's still a glitch. Don't test until further notice.

pditommaso commented 6 years ago

I've uploaded a new snapshot that should manage correctly throttling for all requests.

  Version: 0.31.0-SNAPSHOT build 4926
  Modified: 08-07-2018 22:29 UTC (09-07-2018 00:29 CEST)

As before update you snapshot with the command:

CAPSULE_RESET=1 NXF_VER=0.31.0-SNAPSHOT nextflow info

Then use:

NXF_VER=0.31.0-SNAPSHOT nextflow run .. etc

A brief explanation for the avail options can be found in the source code.

/cc @fstrozzi

tbugfinder commented 6 years ago

First run was pretty good. I'll do some additional testings and also upload submission rate output. So far this is a great step forward. I'm impressed, although AWS API seems to be very strict in limiting API calls.

pditommaso commented 6 years ago

Yeah, I think they cap at 20/s

tbugfinder commented 6 years ago

submission.rate.log.txt

tbugfinder commented 6 years ago

So multi-threaded sessions or array jobs could reduce the submission time once more. As you can see it's taken 5h now.

pditommaso commented 6 years ago

So, is it good?

fstrozzi commented 6 years ago

So far so good, I tested this version and the difference in the submission speed is impressive . I'll test it more extensively and I'll report here any issue.

tbugfinder commented 6 years ago

Still testing, also killing a running submission using CTRL+C.

....jobs....

Oops .. something went wrong
ERROR ~ Unexpected error [AbortedException]

 -- Check script 'main.nf' at line: 54 or see '.nextflow.log' file for more details
[4c/15c5f4] Submitted process > ...
WARN: Killing pending tasks (266)
[a2/340c94] Submitted process > ......

One CTRL+C prints "Killing pending tasks" however job submission continues until a 2nd CTRL+C which cleans up submitted jobs using the new throttling method (very good).

tbugfinder commented 6 years ago

Anyway, the submission speed is a major improvement - impressive.

pditommaso commented 6 years ago

Can you upload the .nextflow.log file the execution reporting that error?

tbugfinder commented 6 years ago

nextflow.log.KILLING.log

tbugfinder commented 6 years ago

I don't know if this is raised by NF and I didn't pay attention to the compute environments, but there's an error displayed (flapping, though):

image

pditommaso commented 6 years ago

@fstrozzi have you any idea of what may cause this message?

fstrozzi commented 6 years ago

@tbugfinder interesting, I've never seen a similar status on AWS Batch. Which kind of IAM setup do you use to interact with Batch ? Have you setup a IAM Profile or have you granted full Batch permissions to your IAM user ?

tbugfinder commented 6 years ago

I've openend a call at AWS.