Closed matthdsm closed 1 month ago
Hi @matthdsm ,
Interesting, the closest experience on my side has been a WARN
ing that the job hasn't been allocated to a node yet. Which we addressed in a recent commit.
Could you please share a minimal reproducible use-case and the version of the plugin used?
Ideally
We were rebooting some of our services and the public address of our nomad server was offline for a short while. I got the following in the logs
Jul-17 07:07:01.940 [Task monitor] DEBUG n.nomad.executor.NomadService - [NOMAD] Failed to get jobState nf-70f3e6e3033816668cd1ffc4e6217165-NFCMGG_PREPROCESSING_PRE -- Cause: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
io.nomadproject.client.ApiException: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
at io.nomadproject.client.ApiClient.execute(ApiClient.java:928)
at io.nomadproject.client.api.JobsApi.getJobAllocationsWithHttpInfo(JobsApi.java:629)
at io.nomadproject.client.api.JobsApi.getJobAllocations(JobsApi.java:596)
at nextflow.nomad.executor.NomadService.getJobState(NomadService.groovy:274)
at nextflow.nomad.executor.NomadTaskHandler.taskState0(NomadTaskHandler.groovy:187)
at nextflow.nomad.executor.NomadTaskHandler.checkIfCompleted(NomadTaskHandler.groovy:87)
at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:649)
at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:571)
at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:441)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:316)
at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
at groovy.lang.Closure.run(Closure.java:505)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Failed to connect to nomad.ops.cmgg.be/172.20.1.206:80
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:297)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207)
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226)
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
at io.nomadproject.client.ApiClient.execute(ApiClient.java:924)
... 24 common frames omitted
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.base/java.net.Socket.connect(Socket.java:609)
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:120)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295)
... 40 common frames omitted
Ah, its related to the main http
connection itself. Sure, this will be addressed soon 👍
interesting edge case to be addressed
@matthdsm @jhaezebr a couple of questions for you both
How many nomad servers are in the cluster?
The nomad.ops.cmgg.be/172.20.1.206:80
is the address of the leader right?
Hi,
We noticed the nextflow process crashes when the plugin (temporarily) can't query the jobstate. Perhaps it would be good to add a timeout and some retries here?
Cheers M