Open JoshuaEPSamuel opened 1 week ago
Any update on root cause of this error message?
The error 500 Internal server error suggest that's a failure on AWS Batch. therefore it should not be managed by nextflow.
Note also this is not a support service, but a developer and space to improve the framework.
The error 500 Internal server error suggest that's a failure on AWS Batch. therefore it should not be managed by nextflow.
Note also this is not a support service, but a developer and space to improve the framework.
Understood that you may not be able to explain the error code nor solutions to preventing it.
However, I would still like to highlight the unusual behavior from this error code which crashes the Nextflow pipeline, does not terminate the Batch jobs, and even continues to submit more jobs without Nextflow running.
If either cancelling a Nextflow script, or a bug in a process/script causes an error which exits Nextflow, terminates all Batch instances associated; it is very odd that an AWS Batch error is not handled the same way. Especially when this can happen across parallel Nextflow pipelines simultaneously, multiplying the cost of wasted cloud compute.
Fair enough the error should be propagated
@JoshuaEPSamuel are you able to include the .nextflow.log
of the faulty execution
nextflow_crash_1.log nextflow_crash_2.log
I've added both logs from the 2 Nextflow pipelines running simultaneously which crashed and continued spawning jobs.
I have seen DEFAULT_MAX_SPOT_ATTEMPTS in version 24.02.0-edge is 5, but in latest version it is 0. It was disabled in August. Maybe set it to 0 with aws.batch.maxSpotAttempts.
I'm not still able to replicate the problem.
Looking at the log it is reported a task running 07:57:42
Nov-15 07:57:42.683 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor awsbatch > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 7; name: recursive_workflow:<proc2> (2); status: RUNNING; exit: -; error: -; workDir: s3://nf-temp-dirs/9f/f00ca677f776f5d48107e63ce9ddcf]
A few seconds later it is reported the 500 Internal server error
Nov-15 08:00:33.612 [AWSBatch-executor-5269] DEBUG nextflow.Session - Session aborted -- Cause: Internal server error (Service: AWSBatch; Status Code: 500; Error Code: AuthorizerConfigurationException; Request ID: 8945d2bc-d12f-489a-a62b-507f6cd54230; Proxy: null)
However, immediately after it initiate the shutdown and quite a few milliseconds after
Nov-15 08:00:33.915 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye
Not sure where the hanging is happening.
Unfortunately I no longer have access to any logs from AWS Batch/EC2 as they have expired. But I do have the following additional information:
Log file timestamp here shows that program has been running long overdue the Nextflow abortion. The log file only shows information relevant to the process with no errors relevant to AWS Batch or Nextflow - so not needed to include here. This screenshot is of 1 of the 2 pipelines, the other pipeline workdir too has a later log timestamp.
For both pipelines, I have nextflow reports generated via -with-report
. Not sure how much information is useful here for you, but it's interesting the status column for the running process shows 'running' instead of 'aborted'. This report file was not updated after Nextflow crashed, unlike the .command.log in the workdir which was constantly being updated by the running process.
For full reproducibility, I have the following command executed which runs the nextflow pipelines:
cd dir1
nextflow run nextflow-repo/nextflow_scripts/recursion_workflow_script.nf \
-c nextflow-repo/nextflow_recursive.config \
-bucket-dir s3://nf-temp-dirs/ \
-profile awsbatch_spot \
-with-report \
-bg \
--PARAM1=<val1> &> out.log
.....
.....
cd dir2
nextflow run nextflow-repo/nextflow_scripts/recursion_workflow_script.nf \
-c nextflow-repo/nextflow_recursive.config \
-bucket-dir s3://nf-temp-dirs/ \
-profile awsbatch_spot \
-with-report \
--PARAM1=<val2> &> out.log
Is it possible to get the content of the .command.log.? Is this file where you saw the hangup signal? I am wondering if the AWSBatch "autonomous" retries are due to the spot attempts configuration.
Is this file where you saw the hangup signal?
The hangup signal was on the node terminal not inside the container - this is why the .command.log does not include any errors and the last lines of log show intermediate step calculations that were exited (the log contains a lot of sensitive information which I don't see needs sharing at this time).
Here's a better description of the event:
1) Logged into node using Session Manager after I noticed that the job was still running after Nextflow had exited
2) In the node terminal, ran sudo docker ps
followed by sudo docker excec -it <container_ID> bash
to interactively inspect the container.
3) Inside the container, I ran tail -f .command.log
to see the live log, which was running fine surprisingly 2 or more hours after Nextflow exit.
4) While watching the log, there was a hangup signal, and this exited the container and brought me back to the node terminal. Only the container was killed, the node was still running.
5) After looking at the AWS Batch dashboard, I noticed that the same job was being reattempted (Status: Runnable -> Starting -> Running etc.), which I was then saw spawn another container on my instance running the process from the start.
6) Seeing the container reattempt the task without Nextflow running, and therefore controlling it, I manually terminated the job which it did not try to re-run.
I am wondering if the AWSBatch "autonomous" retries are due to the spot attempts configuration.
Potentially this, I have only defined hardware/process requirements in my nextflow.config
, which is why I mentioned the following at the start in case of issues with default values:
My config does not specify any changes to aws.batch.maxSpotAttempts and aws.batch.maxTransferAttempts - do I need to?
The Role assigned to my (AWS) Compute Environment/Job Queue only includes policies for EC2/S3 etc - am I missing Batch policies here such as Describe/Submit job?
Couldn't tell if these updates A) prevent an AWS Batch Error 500, or B) prevent AWS Batch from reattempting tasks outside of Nextflow
I have been able to reproduce a similar error. The describeJob call to the AWS produces an unexpected error and a session abort because of the unexpected error in a job monitoring, but also it is considering the Job as Failed because of the unexpected error in a job monitoring and consequently removes it from the running queue. It is the reason why it is not cancelling the Job at then end of the execution. Working on a fix but it is a bit tricky because the failing call is the one to check if the job is running.
Interesting, I'm assuming your instance role does not include the AWS Batch policy describeJob?
@JoshuaEPSamuel, version 24.10.2 is just released. Could you try to run this version? It has some changes that could fix the issue that the version you use (24.02.0-edge) do not have. First, it has the patch for #4709 that retries when there is an AWS Batch 500 error and also fixes the NPE that aborts the execution. Moreover, the default spot attempts is 0.
Noticed there was a closed issue #4709 which seemed relevant to my error, but unsure if closed issues get new alerts so created another issue.
I was running one of my pipelines on AWS Batch, as usual, but when I checked the progress I had noticed that the Nextflow script had crashed with the same Error 500 message here - but the jobs were still running and Batch could submit new jobs.
Here is the error message:
Background:
N E X T F L O W ~ version 24.02.0-edge
and the recursion featureThe script had crashed for around 4 hours before I had a look at the log, and whilst I was investigating this issue, I noticed that one of the scripts received a
hangup
signal whilst I was reading the logs, which exited the docker container the script was running in. AWS Batch then immediately submitted another copy of the job, even though Nextflow exited hours ago - at this point I manually terminated my AWS Batch jobs.##########################
I've got a few questions to make sure there isn't an oversight on my side:
aws.batch.maxSpotAttempts
andaws.batch.maxTransferAttempts
- do I need to?