Closed trudesea closed 1 year ago
@kuisathaverat could someone from Elastic look at this?
I guess the enhancement resides here: https://github.com/jenkinsci/opentelemetry-plugin/blob/opentelemetry-2.13.0/src/main/java/io/jenkins/plugins/opentelemetry/backend/elastic/ElasticsearchLogStorageRetriever.java#L110
this.restClient = RestClient.builder(HttpHost.create(elasticsearchUrl))
.setHttpClientConfigCallback(httpClientBuilder -> {
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
if (disableSslVerifications) {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustAllStrategy()).build();
httpClientBuilder.setSSLContext(sslContext);
} catch (GeneralSecurityException e) {
logger.log(Level.WARNING, "IllegalStateException: failure to disable SSL certs verification");
}
httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
}
return httpClientBuilder;
})
.build();
this.elasticsearchTransport = new RestClientTransport(restClient, new JacksonJsonpMapper());
this.esClient = new ElasticsearchClient(elasticsearchTransport);
and relates to org.apache.http.impl.nio.client.HttpAsyncClientBuilder#setKeepAliveStrategy
and relates to org.apache.http.impl.client.HttpClientBuilder#setDefaultSocketConfig
and org.apache.http.config.SocketConfig.Builder#setSoKeepAlive
It can be tuned at OS level, it is a matter to change the TCP stack settings to increase the keepalive packages, the default values are configured for a Desktop system, IIRC 7200 seconds to start sending keepalive packages
sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30
@kuisathaverat
Thank you for the response. That change was my fallback plan as Jenkins is running in docker and modifying sysctl settings isn't really "smiled" upon when using host networking. In our case the docker containers are running in kubernetes and this is considered "unsafe"
This is why the Google Engineer asked if it was possible to change a keep alive on the OT client side itself and why I posted the question.
I shelled in and looked at the systctl settings on the jenkins container, it is indeed 7200. Not sure if the jenkins agent containers will also need to be changed...I assume the controller is what sends the data to the APM server.
I will have to test this thoroughly in our dev environment
I quickly looked at the Elasticsearch HTTP client APIs (ie the Apache HTTP Client APIs):
org.apache.http.impl.client.HttpClientBuilder#setDefaultSocketConfig
org.apache.http.config.SocketConfig.Builder#setSoKeepAlive
@cyrille-leclerc
So I'm a systems engineer, not a developer but I assume this is something that I cannot change?
Also does the provided trace definitely point to an issue with the OT plugin and a connection drop or could it be something else failing in the Jenkins job and OT plugin is just a victim? The reason I ask is that I see no evidence of a drop on the APM server LB and backend logs. I have the LB backend timeout set to 620sec and 640sec on the APM server.
Thanks again for your help.
I think I found how to do it https://github.com/elastic/elasticsearch/issues/65213
nop, it enabled the keepalive but relays on the TCP stack to set the times.
Thanks @kuisathaverat for your perseverance on this!
This is as much as we can do https://github.com/jenkinsci/opentelemetry-plugin/blob/master/docs/setup-and-configuration.md#tcp-keepalive
Jenkins and plugins versions report
Environment
```text Jenkins: 2.387.1 OS: Linux - 5.10.162+ Java: 11.0.18 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- ace-editor:1.1 allure-jenkins-plugin:2.30.3 ansicolor:1.0.2 antisamy-markup-formatter:159.v25b_c67cd35fb_ apache-httpcomponents-client-4-api:4.5.14-150.v7a_b_9d17134a_5 atlassian-bitbucket-server-integration:3.3.2 authentication-tokens:1.53.v1c90fd9191a_b_ authorize-project:1.5.1 bitbucket-kubernetes-credentials:152.v16906f88086d bootstrap4-api:4.6.0-5 bootstrap5-api:5.2.2-2 bouncycastle-api:2.27 branch-api:2.1071.v1a_188a_562481 build-name-setter:2.2.0 caffeine-api:2.9.3-65.v6a_47d0f4d1fe checks-api:2.0.0 cloudbees-folder:6.815.v0dd5a_cb_40e0e command-launcher:90.v669d7ccb_7c31 commons-lang3-api:3.12.0-36.vd97de6465d5b_ commons-text-api:1.10.0-36.vc008c8fcda_7b_ configuration-as-code:1625.v27444588cc3d credentials:1224.vc23ca_a_9a_2cb_0 credentials-binding:604.vb_64480b_c56ca_ custom-checkbox-parameter:1.4 display-url-api:2.3.7 durable-task:504.vb10d1ae5ba2f echarts-api:5.4.0-3 email-ext:2.96 extended-choice-parameter:359.v35dcfdd0c20d font-awesome-api:6.3.0-2 generic-webhook-trigger:1.86.3 git:5.0.0 git-client:4.2.0 git-server:1.11 github:1.37.0 github-api:1.303-417.ve35d9dd78549 github-branch-source:1703.vd5a_2b_29c6cdc google-kubernetes-engine:0.8.8 google-metadata-plugin:0.4 google-oauth-plugin:1.0.8 google-storage-plugin:1.5.8 groovy:453.vcdb_a_c5c99890 handlebars:3.0.8 hashicorp-vault-pipeline:1.4 hashicorp-vault-plugin:360.v0a_1c04cf807d instance-identity:142.v04572ca_5b_265 ionicons-api:45.vf54fca_5d2154 jackson2-api:2.14.2-319.v37853346a_229 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javadoc:233.vdc1a_ec702cff javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-8 jaxb:2.3.8-1 jdk-tool:63.v62d2fd4b_4793 jjwt-api:0.11.5-77.v646c772fddb_0 job-dsl:1.83 jquery:1.12.4-1 jquery3-api:3.6.4-1 jsch:0.1.55.61.va_e9ee26616e7 junit:1198.ve38db_d1b_c975 kubernetes:3923.v294a_d4250b_91 kubernetes-client-api:6.4.1-215.v2ed17097a_8e9 kubernetes-credentials:0.10.0 kubernetes-credentials-provider:1.211.vc236a_f5a_2f3c logstash:2.5.0205.vd05825ed46bd mailer:448.v5b_97805e3767 matrix-auth:3.1.7 matrix-project:789.v57a_725b_63c79 maven-plugin:3.21 metrics:4.2.13-420.vea_2f17932dd6 mina-sshd-api-common:2.9.2-62.v199162f0a_2f8 mina-sshd-api-core:2.9.2-62.v199162f0a_2f8 momentjs:1.1.1 oauth-credentials:0.645.ve666a_c332668 okhttp-api:4.10.0-132.v7a_7b_91cef39c opentelemetry:2.13.0 pam-auth:1.10 phabricator-k8s:1.0.0 phabricator-plugin:2.1.5 pipeline-build-step:488.v8993df156e8d pipeline-github-lib:42.v0739460cda_c4 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-groovy-lib:656.va_a_ceeb_6ffb_f7 pipeline-input-step:466.v6d0a_5df34f81 pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2125.vddb_a_44a_d605e pipeline-model-definition:2.2125.vddb_a_44a_d605e pipeline-model-extensions:2.2125.vddb_a_44a_d605e pipeline-rest-api:2.32 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2125.vddb_a_44a_d605e pipeline-stage-view:2.32 pipeline-utility-steps:2.15.1 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:3.2.0 popper-api:1.16.1-3 popper2-api:2.11.6-2 saltstack:3.2.2 scm-api:631.v9143df5b_e4a_a script-security:1244.ve463715a_f89c snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4 ssh-credentials:305.v8f4381501156 ssh-slaves:2.877.v365f5eb_a_b_eec sshd:3.249.v2dc2ea_416e33 structs:324.va_f5d6774f3a_d terraform:1.0.10 theme-manager:1.6 throttle-concurrents:2.10 token-macro:359.vb_cde11682e0c trilead-api:2.84.v72119de229b_7 uno-choice:2.6.5 variant:59.vf075fe829ccb view-job-filters:364.v48a_33389553d workflow-aggregator:596.v8c21c963d92d workflow-api:1208.v0cc7c6e0da_9e workflow-basic-steps:1010.vf7a_b_98e847c1 workflow-cps:3659.v582dc37621d8 workflow-cps-global-lib:588.v576c103a_ff86 workflow-durable-task-step:1244.vee71f675dee6 workflow-job:1289.vd1c337fd5354 workflow-multibranch:733.v109046189126 workflow-scm-step:408.v7d5b_135a_b_d49 workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:839.v35e2736cfd5c ```What Operating System are you using (both controller, and any agents involved in the problem)?
Centos (from dockerhub), inbound agent 4-10.3 from dockerhub
Reproduction steps
Normal operation of plugin
Expected Results
No errors in pipeline execution
Actual Results
Stack Trace:
Anything else?
Although not a bug, I don't know where else to ask this question. We are trying to determine why we are getting: java.io.IOException: Connection reset by peer errors in our pipelines. These seem to have increased with the addition of the OT plugin. We are sending traces/logs to Elastic APM. The idle timeout setting on the APM server is 45s. We have 3 APM servers running in GKE and are behind a GCP LB
I'm wondering if it is possible to change the tcp keep alive for the OT client. Google engineer stated that this would be one option to try.
Thanks for any insight