nextflow-io / nf-nomad

Hashicorp Nomad executor plugin for Nextflow
https://nextflow-io.github.io/nf-nomad/
Apache License 2.0
2 stars 4 forks source link

Implement an exception handler for api client #72

Closed abhi18av closed 1 month ago

abhi18av commented 4 months ago

This PR

  1. Adds a baseline exception handler for the Nomad server connection
  2. Exposes some of the configuration values used to create the API client (for the server)
jagedn commented 4 months ago

not sure this implementation avoid the connection error when the server is restarted, did you tested ?

jagedn commented 4 months ago

as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic

https://square.github.io/okhttp/features/interceptors/

abhi18av commented 4 months ago

as the Nomad API uses OkHttpApi maybe we can evaluate how to add an interceptor and implement a retry logic

https://square.github.io/okhttp/features/interceptors/

Agreed there are multiple options in the HTTP client which we can possibly expose.

Regarding testing the PR, will check on local cluster when I'm back at my desk, but I'm starting to think that we should not use that Process.. Exception.., I think it intrinsically terminates the execution 🤔

abhi18av commented 4 months ago

So I did the experiment with branch and this is what I experienced. Different failure compared to #71

Experimental setup:

  1. Run ./start-nomad.sh in validation
  2. Trigger ./run-all.sh --build
  3. When the execution is underway, kill the nomad process.
  4. Restart the nomad process, without clearing any cache.

With the current server-exception branch

executor >  nomad (4)
[c2/463003] sayHello (4) [100%] 4 of 4, failed: 4 ✘
WARN: [NOMAD] Cannot read exit status for task: `sayHello (2)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (3)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/18/e8010df35254f73ce6767a7857d917/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (4)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c2/463003f44e018b65d32d94a5aabf8f/.exitcode
WARN: [NOMAD] Cannot read exit status for task: `sayHello (1)` | /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/56/04ff36f7635d14c721c7019df8d2c1/.exitcode
ERROR ~ Error executing process > 'sayHello (2)'

Caused by:
  Process `sayHello (2)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  echo 'Ciao world!'

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /Users/abhi/projects/nf-nomad/validation/nomad_temp/scratchdir/c1/923113855fdf5527bd2b46e11a584c

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
jagedn commented 2 months ago

@abhi18av

I've refactored our work and now we are using FailSafe approach

still need to test a little more (trying to stop the cluster and so on) but it looks nice

jagedn commented 2 months ago

I want to implement a more robust test but stopping/restarting manually the nomad process during a bactopia pipeline (because it takes more time to complete than a simple hello) seems to work:

cc @matthdsm

sept-21 12:27:06.156 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:06.161 [TaskFinalizer-2] DEBUG nextflow.processor.TaskProcessor - Process BACTOPIA:GATHER:CSVTK_CONCAT > Skipping output binding because one or more optional files are missing: fileoutparam<1>
sept-21 12:27:06.161 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=running
sept-21 12:27:11.129 [Task monitor] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] determineClientNode: jobName:nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; clientName:slimbook

sept-21 12:27:11.344 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 1; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:11.958 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 2; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:12.797 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 3; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:15.098 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 4; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:18.115 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 5; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:26.336 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 6; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646
sept-21 12:27:44.490 [Task monitor] DEBUG n.nomad.executor.FailsafeExecutor - Nomad TooManyRequests response error - attempt: 7; reason: java.net.ConnectException: Failed to connect to localhost/127.0.0.1:4646

sept-21 12:27:44.493 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfRunning jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.495 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] checkIfDead jobID=nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR; status=dead
sept-21 12:27:44.497 [Task monitor] DEBUG n.nomad.executor.NomadService - Task nf-ce8da502fb20d2b26785dd70f6c85afc-BACTOPIA_QC_QC_MODULE_SR , state=dead