Closed abhi18av closed 1 month ago
not sure this implementation avoid the connection error when the server is restarted, did you tested ?
as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic
as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic
Agreed there are multiple options in the HTTP client which we can possibly expose.
Regarding testing the PR, will check on local cluster when I'm back at my desk, but I'm starting to think that we should not use that Process.. Exception..
, I think it intrinsically terminates the execution 🤔
So I did the experiment with branch and this is what I experienced. Different failure compared to #71
Experimental setup:
./start-nomad.sh
in validation
./run-all.sh --build
server-exception
branchexecutor > nomad (4)
[c2/463003] sayHello (4) [100%] 4 of 4, failed: 4 ✘
WARN: [NOMAD] Cannot read exit status for task: `sayHello (2)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (3)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/18/e8010df35254f73ce6767a7857d917/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (4)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c2/463003f44e018b65d32d94a5aabf8f/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (1)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/56/04ff36f7635d14c721c7019df8d2c1/.exitcode
ERROR ~ Error executing process > 'sayHello (2)'
Caused by:
Process `sayHello (2)` terminated for an unknown reason -- Likely it has been terminated by the external system
Command executed:
echo 'Ciao world!'
Command exit status:
-
Command output:
(empty)
Work dir:
/Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
@abhi18av
I've refactored our work and now we are using FailSafe approach
still need to test a little more (trying to stop the cluster and so on) but it looks nice
I want to implement a more robust test but stopping/restarting manually the nomad process during a bactopia pipeline (because it takes more time to complete than a simple hello) seems to work:
cc @matthdsm
sept-21 12:27:06.156 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:06.161 [TaskFinalizer-2] DEBUG nextflow.processor.TaskProcessor - Process BACTOPIA:GATHER:CSVTK_CONCAT > Skipping output binding because one or more optional files are missing: fileoutparam<1>
sept-21 12:27:06.161 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:11.129 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] determineClientNode: jobName:nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; clientName:slimbook
sept-21 12:27:11.344 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 1; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:11.958 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 2; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:12.797 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 3; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:15.098 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 4; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:18.115 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 5; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:26.336 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 6; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:44.490 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 7; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:44.493 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.495 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.497 [Task monitor] DEBUG n.nomad.executor.NomadService - Task nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR , state=dead
This PR