nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 634 forks source link

AWSBatch Error 500 #5516

Open JoshuaEPSamuel opened 1 week ago

JoshuaEPSamuel commented 1 week ago

Noticed there was a closed issue #4709 which seemed relevant to my error, but unsure if closed issues get new alerts so created another issue.

I was running one of my pipelines on AWS Batch, as usual, but when I checked the progress I had noticed that the Nextflow script had crashed with the same Error 500 message here - but the jobs were still running and Batch could submit new jobs.

Here is the error message:

DEBUG nextflow.Session - Session aborted -- Cause: Internal server error (Service: AWSBatch; Status Code: 500; Error Code: AuthorizerConfigurationException; Request ID: [...]; Proxy: null)

Background:

The script had crashed for around 4 hours before I had a look at the log, and whilst I was investigating this issue, I noticed that one of the scripts received a hangup signal whilst I was reading the logs, which exited the docker container the script was running in. AWS Batch then immediately submitted another copy of the job, even though Nextflow exited hours ago - at this point I manually terminated my AWS Batch jobs.

##########################

I've got a few questions to make sure there isn't an oversight on my side:

JoshuaEPSamuel commented 6 days ago

Any update on root cause of this error message?

pditommaso commented 6 days ago

The error 500 Internal server error suggest that's a failure on AWS Batch. therefore it should not be managed by nextflow.

Note also this is not a support service, but a developer and space to improve the framework.

JoshuaEPSamuel commented 6 days ago

The error 500 Internal server error suggest that's a failure on AWS Batch. therefore it should not be managed by nextflow.

Note also this is not a support service, but a developer and space to improve the framework.

Understood that you may not be able to explain the error code nor solutions to preventing it.

However, I would still like to highlight the unusual behavior from this error code which crashes the Nextflow pipeline, does not terminate the Batch jobs, and even continues to submit more jobs without Nextflow running.

If either cancelling a Nextflow script, or a bug in a process/script causes an error which exits Nextflow, terminates all Batch instances associated; it is very odd that an AWS Batch error is not handled the same way. Especially when this can happen across parallel Nextflow pipelines simultaneously, multiplying the cost of wasted cloud compute.

pditommaso commented 6 days ago

Fair enough the error should be propagated

pditommaso commented 6 days ago

@JoshuaEPSamuel are you able to include the .nextflow.log of the faulty execution

JoshuaEPSamuel commented 6 days ago

nextflow_crash_1.log nextflow_crash_2.log

I've added both logs from the 2 Nextflow pipelines running simultaneously which crashed and continued spawning jobs.

jorgee commented 6 days ago

I have seen DEFAULT_MAX_SPOT_ATTEMPTS in version 24.02.0-edge is 5, but in latest version it is 0. It was disabled in August. Maybe set it to 0 with aws.batch.maxSpotAttempts.

pditommaso commented 6 days ago

I'm not still able to replicate the problem.

Looking at the log it is reported a task running 07:57:42

Nov-15 07:57:42.683 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor awsbatch > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 7; name: recursive_workflow:<proc2> (2); status: RUNNING; exit: -; error: -; workDir: s3://nf-temp-dirs/9f/f00ca677f776f5d48107e63ce9ddcf]

A few seconds later it is reported the 500 Internal server error

Nov-15 08:00:33.612 [AWSBatch-executor-5269] DEBUG nextflow.Session - Session aborted -- Cause: Internal server error (Service: AWSBatch; Status Code: 500; Error Code: AuthorizerConfigurationException; Request ID: 8945d2bc-d12f-489a-a62b-507f6cd54230; Proxy: null)

However, immediately after it initiate the shutdown and quite a few milliseconds after

Nov-15 08:00:33.915 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Not sure where the hanging is happening.

JoshuaEPSamuel commented 5 days ago

Unfortunately I no longer have access to any logs from AWS Batch/EC2 as they have expired. But I do have the following additional information:

Screenshot of workdir S3 bucket

Log file timestamp here shows that program has been running long overdue the Nextflow abortion. The log file only shows information relevant to the process with no errors relevant to AWS Batch or Nextflow - so not needed to include here. This screenshot is of 1 of the 2 pipelines, the other pipeline workdir too has a later log timestamp. image

Nextflow report

For both pipelines, I have nextflow reports generated via -with-report. Not sure how much information is useful here for you, but it's interesting the status column for the running process shows 'running' instead of 'aborted'. This report file was not updated after Nextflow crashed, unlike the .command.log in the workdir which was constantly being updated by the running process. image

Nextflow execution

For full reproducibility, I have the following command executed which runs the nextflow pipelines:

cd dir1
nextflow run nextflow-repo/nextflow_scripts/recursion_workflow_script.nf \
    -c nextflow-repo/nextflow_recursive.config \
    -bucket-dir s3://nf-temp-dirs/ \
    -profile awsbatch_spot \
    -with-report \
    -bg \
    --PARAM1=<val1> &> out.log
.....    
.....
cd dir2
nextflow run nextflow-repo/nextflow_scripts/recursion_workflow_script.nf \
    -c nextflow-repo/nextflow_recursive.config \
    -bucket-dir s3://nf-temp-dirs/ \
    -profile awsbatch_spot \
    -with-report \
    --PARAM1=<val2> &> out.log
jorgee commented 5 days ago

Is it possible to get the content of the .command.log.? Is this file where you saw the hangup signal? I am wondering if the AWSBatch "autonomous" retries are due to the spot attempts configuration.

JoshuaEPSamuel commented 5 days ago

Is this file where you saw the hangup signal?

The hangup signal was on the node terminal not inside the container - this is why the .command.log does not include any errors and the last lines of log show intermediate step calculations that were exited (the log contains a lot of sensitive information which I don't see needs sharing at this time).

Here's a better description of the event: 1) Logged into node using Session Manager after I noticed that the job was still running after Nextflow had exited 2) In the node terminal, ran sudo docker ps followed by sudo docker excec -it <container_ID> bash to interactively inspect the container. 3) Inside the container, I ran tail -f .command.log to see the live log, which was running fine surprisingly 2 or more hours after Nextflow exit. 4) While watching the log, there was a hangup signal, and this exited the container and brought me back to the node terminal. Only the container was killed, the node was still running. 5) After looking at the AWS Batch dashboard, I noticed that the same job was being reattempted (Status: Runnable -> Starting -> Running etc.), which I was then saw spawn another container on my instance running the process from the start. 6) Seeing the container reattempt the task without Nextflow running, and therefore controlling it, I manually terminated the job which it did not try to re-run.

JoshuaEPSamuel commented 5 days ago

I am wondering if the AWSBatch "autonomous" retries are due to the spot attempts configuration.

Potentially this, I have only defined hardware/process requirements in my nextflow.config, which is why I mentioned the following at the start in case of issues with default values:

My config does not specify any changes to aws.batch.maxSpotAttempts and aws.batch.maxTransferAttempts - do I need to?

The Role assigned to my (AWS) Compute Environment/Job Queue only includes policies for EC2/S3 etc - am I missing Batch policies here such as Describe/Submit job?

Couldn't tell if these updates A) prevent an AWS Batch Error 500, or B) prevent AWS Batch from reattempting tasks outside of Nextflow

jorgee commented 3 days ago

I have been able to reproduce a similar error. The describeJob call to the AWS produces an unexpected error and a session abort because of the unexpected error in a job monitoring, but also it is considering the Job as Failed because of the unexpected error in a job monitoring and consequently removes it from the running queue. It is the reason why it is not cancelling the Job at then end of the execution. Working on a fix but it is a bit tricky because the failing call is the one to check if the job is running.

JoshuaEPSamuel commented 3 days ago

Interesting, I'm assuming your instance role does not include the AWS Batch policy describeJob?

jorgee commented 2 days ago

@JoshuaEPSamuel, version 24.10.2 is just released. Could you try to run this version? It has some changes that could fix the issue that the version you use (24.02.0-edge) do not have. First, it has the patch for #4709 that retries when there is an AWS Batch 500 error and also fixes the NPE that aborts the execution. Moreover, the default spot attempts is 0.

pditommaso commented 16 hours ago

@jorgee looking more into this, I believe the cleanup is invoked when the abort is triggered by a batch executor error