nextflow-io / nf-nomad

Hashicorp Nomad executor plugin for Nextflow
https://nextflow-io.github.io/nf-nomad/
Apache License 2.0
2 stars 4 forks source link

Nexflow crashes when querying jobstate (from a dead server) #71

Closed matthdsm closed 1 month ago

matthdsm commented 4 months ago

Hi,

We noticed the nextflow process crashes when the plugin (temporarily) can't query the jobstate. Perhaps it would be good to add a timeout and some retries here?

Cheers M

abhi18av commented 4 months ago

Hi @matthdsm ,

Interesting, the closest experience on my side has been a WARNing that the job hasn't been allocated to a node yet. Which we addressed in a recent commit.

Could you please share a minimal reproducible use-case and the version of the plugin used?

Ideally

  1. Nextflow log
  2. Any specific command /config /pipeline not in the main log
matthdsm commented 4 months ago

We were rebooting some of our services and the public address of our nomad server was offline for a short while. I got the following in the logs

Jul-17 07:07:01.940 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] Failed to get jobState nf-70f3e6e3033816668cd1ffc4e6217165-NFCMGG_PREPROCESSING_PRE -- Cause: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
io.nomadproject.client.ApiException: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
        at io.nomadproject.client.ApiClient.execute(ApiClient.java:928)
        at io.nomadproject.client.api.JobsApi.getJobAllocationsWithHttpInfo(JobsApi.java:629)
        at io.nomadproject.client.api.JobsApi.getJobAllocations(JobsApi.java:596)
        at nextflow.nomad.executor.NomadService.getJobState(NomadService.groovy:274)
        at nextflow.nomad.executor.NomadTaskHandler.taskState0(NomadTaskHandler.groovy:187)
        at nextflow.nomad.executor.NomadTaskHandler.checkIfCompleted(NomadTaskHandler.groovy:87)
        at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:649)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:571)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:441)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:316)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:505)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
        at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
        at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
        at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
        at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
        at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
        at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
        at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
        at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
        at io.nomadproject.client.ApiClient.execute(ApiClient.java:924)
        ... 24 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.base/java.net.Socket.connect(Socket.java:609)
        at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:120)
        at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
        ... 40 common frames omitted
abhi18av commented 4 months ago

Ah, its related to the main http connection itself. Sure, this will be addressed soon 👍

jagedn commented 4 months ago

interesting edge case to be addressed

abhi18av commented 4 months ago

@matthdsm @jhaezebr a couple of questions for you both

  1. How many nomad servers are in the cluster?

  2. The nomad.ops.cmgg.be/172.20.1.206:80 is the address of the leader right?