Closed eskuai closed 4 years ago
Hello, @eskuai. It looks like you have had an intermittent connectivity issue between SCDF and Skipper. And, it also appears that it had resumed operation afterward.
You may want to review Skipper's deployment/pod logs. Specifically, when you kubectl describe ..
for these resources, you would find hints to why the Skipper deployment was choking sporadically.
Hello @sabbyanandan,
I am trying to get some info to show you, Today, i got 3 times the same problem ...
On K8s, no pods restarted, no audit info about problem .. no connection warn... I cant understand why...
I've got to restart scdf and skipper to increase memory values ... but ... i got another connection timeout...
scdf2
[root@k8s-master ~]# kubectl describe pod scdf2-data-flow-server-fcdbc78d5-xv6nl
Name: scdf2-data-flow-server-fcdbc78d5-xv6nl
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node1-scdf2/10.0.1.121
Start Time: Tue, 15 Oct 2019 17:15:14 +0200
Labels: app=spring-cloud-data-flow
component=server
pod-template-hash=fcdbc78d5
release=scdf2
Annotations: <none>
Status: Running
IP: 10.44.0.4
Controlled By: ReplicaSet/scdf2-data-flow-server-fcdbc78d5
Containers:
scdf2-data-flow-server:
Container ID: docker://b60be2951c49b56428a082589488514acac2c2fc0359a047c2a4db3b8287a668
Image: springcloud/spring-cloud-dataflow-server:2.2.1.RELEASE
Image ID: docker-pullable://docker.io/springcloud/spring-cloud-dataflow-server@sha256:dd8af6eac46118326172907c08ebd24c8da0f861eb67d333e88001fffb175d62
Port: 8080/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 15 Oct 2019 17:15:14 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 3Gi
Requests:
cpu: 600m
memory: 768Mi
Liveness: http-get http://:http/management/health delay=150s timeout=50s period=60s #success=1 #failure=50
Readiness: http-get http://:http/management/health delay=60s timeout=50s period=15s #success=1 #failure=50
Environment:
LOGGING_LEVEL_ROOT: INFO
KUBERNETES_NAMESPACE: default (v1:metadata.namespace)
JAVA_TOOL_OPTIONS: -Duser.timezone=Europe/Madrid -Djavax.net.ssl.trustStorePassword=cc -Djavax.net.ssl.trustStore=/tmp/scdf2cacerts/cacerts -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:-TieredCompilation -XX:TieredStopAtLevel=1 -XX:+UseCompressedOops -XX:+UseCompressedClassPointers -Xverify:none -XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringDeduplication -Xmx2g
SERVER_PORT: 8080
SPRING_CLOUD_CONFIG_ENABLED: false
SPRING_CLOUD_DATAFLOW_FEATURES_ANALYTICS_ENABLED: false
SPRING_JPA_OPEN_IN_VIEW: false
SPRING_CLOUD_KUBERNETES_SECRETS_ENABLE_API: true
SPRING_CLOUD_DATAFLOW_FEATURES_SCHEDULES_ENABLED: true
SPRING_CLOUD_KUBERNETES_SECRETS_PATHS: /etc/secrets
SPRING_CLOUD_KUBERNETES_CONFIG_NAME: scdf2-data-flow-server
SPRING_CLOUD_SKIPPER_CLIENT_SERVER_URI: http://${SCDF2_DATA_FLOW_SKIPPER_SERVICE_HOST}/api
SPRING_CLOUD_DATAFLOW_SERVER_URI: http://${SCDF2_DATA_FLOW_SERVER_SERVICE_HOST}:${SCDF2_DATA_FLOW_SERVER_SERVICE_PORT}
SPRING_CLOUD_DATAFLOW_SECURITY_CF_USE_UAA: true
SECURITY_OAUTH2_CLIENT_CLIENT_ID: dataflow
SECURITY_OAUTH2_CLIENT_CLIENT_SECRET: xxxxx
SECURITY_OAUTH2_CLIENT_ACCESS_TOKEN_URI: https://uaa-svc:8443/oauth/token
SECURITY_OAUTH2_CLIENT_USER_AUTHORIZATION_URI: https://uaa-svc:8443/oauth/authorize
SECURITY_OAUTH2_RESOURCE_USER_INFO_URI: https://uaa-svc:8443/userinfo
SECURITY_OAUTH2_RESOURCE_TOKEN_INFO_URI: https://uaa-svc:8443/check_token
SPRING_APPLICATION_JSON: { "javax.net.ssl.trustStore": "/tmp/scdf2cacerts/cacerts","javax.net.ssl.trustStorePassword": "cc" , "com.sun.net.ssl.checkRevocation": "false", "maven": { "local-repository": "myLocalrepoMK", "remote-repositories": { "mk-repository": {"url": "http://${NEXUS_SERVICE_HOST}:${NEXUS_SERVICE_PORT}/repository/maven-releases/","auth": {"username": "admin","password": "aa"}},"spring-repo": {"url": "https://repo.spring.io/libs-release","auth": {"username": "","password": ""}},"spring-repo-snapshot": {"url": "https://repo.spring.io/libs-snapshot/","auth": {"username": "","password": ""}}}} }
Mounts:
/etc/localtime from tz-config (rw)
/etc/secrets/database from database (ro)
/tmp/scdf2cacerts from tmpcacerts (ro)
/var/run/secrets/kubernetes.io/serviceaccount from scdf2-data-flow-token-q8zgl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmpcacerts:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2cacerts
Optional: false
tz-config:
Type: HostPath (bare host directory volume)
Path: /usr/share/zoneinfo/Europe/Madrid
HostPathType:
database:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2-database
Optional: false
scdf2-data-flow-token-q8zgl:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2-data-flow-token-q8zgl
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 50m default-scheduler Successfully assigned default/scdf2-data-flow-server-fcdbc78d5-xv6nl to node1-scdf2
Normal Pulled 50m kubelet, node1-scdf2 Container image "springcloud/spring-cloud-dataflow-server:2.2.1.RELEASE" already present on machine
Normal Created 50m kubelet, node1-scdf2 Created container scdf2-data-flow-server
Normal Started 50m kubelet, node1-scdf2 Started container scdf2-data-flow-server
[root@k8s-master ~]#
and skipper
root@k8s-master ~]# kubectl describe pod scdf2-data-flow-skipper-74677588f6-8qvf9
Name: scdf2-data-flow-skipper-74677588f6-8qvf9
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node6-scdf2-glusterfs/10.0.1.94
Start Time: Tue, 15 Oct 2019 17:15:14 +0200
Labels: app=spring-cloud-data-flow
component=skipper
pod-template-hash=74677588f6
release=scdf2
Annotations: <none>
Status: Running
IP: 10.40.0.3
Controlled By: ReplicaSet/scdf2-data-flow-skipper-74677588f6
Containers:
scdf2-data-flow-skipper:
Container ID: docker://79bd5f96d4cb4d2f3329df664ea2427f35c3d67715bce12d4c2f5714b944de0b
Image: springcloud/spring-cloud-skipper-server:2.1.2.RELEASE
Image ID: docker-pullable://docker.io/springcloud/spring-cloud-skipper-server@sha256:b6ea6f8f38ec0afa03c12313303380aec9ec9a0011e92b162faa7c0a854fcc58
Port: 7577/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 15 Oct 2019 17:15:19 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 4Gi
Requests:
cpu: 600m
memory: 768Mi
Liveness: http-get http://:http/actuator/health delay=120s timeout=60s period=60s #success=1 #failure=3
Readiness: http-get http://:http/actuator/health delay=120s timeout=60s period=60s #success=1 #failure=3
Environment:
LOGGING_LEVEL_ROOT: INFO
JAVA_TOOL_OPTIONS: -Duser.timezone=Europe/Madrid -Djavax.net.ssl.trustStorePassword=cc -Djavax.net.ssl.trustStore=/tmp/scdf2cacerts/cacerts -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:-TieredCompilation -XX:TieredStopAtLevel=1 -XX:+UseCompressedOops -XX:+UseCompressedClassPointers -Xverify:none -XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringDeduplication -Xmx2g
KUBERNETES_NAMESPACE: default (v1:metadata.namespace)
SERVER_PORT: 7577
SPRING_JPA_OPEN_IN_VIEW: false
SPRING_CLOUD_CONFIG_ENABLED: false
SPRING_CLOUD_KUBERNETES_SECRETS_ENABLE_API: true
SPRING_CLOUD_KUBERNETES_SECRETS_PATHS: /etc/secrets
SPRING_CLOUD_KUBERNETES_CONFIG_NAME: scdf2-data-flow-skipper
SPRING_CLOUD_DATAFLOW_SECURITY_CF_USE_UAA: true
SECURITY_OAUTH2_CLIENT_CLIENT_ID: skipper
SECURITY_OAUTH2_CLIENT_CLIENT_SECRET: xxxx
SECURITY_OAUTH2_CLIENT_ACCESS_TOKEN_URI: https://uaa-svc:8443/oauth/token
SECURITY_OAUTH2_CLIENT_USER_AUTHORIZATION_URI: https://uaa-svc:8443/oauth/authorize
SECURITY_OAUTH2_RESOURCE_USER_INFO_URI: https://uaa-svc:8443/userinfo
SECURITY_OAUTH2_RESOURCE_TOKEN_INFO_URI: https://uaa-svc:8443/check_token
Mounts:
/etc/localtime from tz-config (rw)
/etc/secrets/database from database (ro)
/tmp/scdf2cacerts from tmpcacerts (ro)
/var/run/secrets/kubernetes.io/serviceaccount from scdf2-data-flow-token-q8zgl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmpcacerts:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2cacerts
Optional: false
tz-config:
Type: HostPath (bare host directory volume)
Path: /usr/share/zoneinfo/Europe/Madrid
HostPathType:
database:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2-database
Optional: false
scdf2-data-flow-token-q8zgl:
Type: Secret (a volume populated by a Secret)
SecretName: scdf2-data-flow-token-q8zgl
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 42m default-scheduler Successfully assigned default/scdf2-data-flow-skipper-74677588f6-8qvf9 to node6-scdf2-glusterfs
Normal Pulling 42m kubelet, node6-scdf2-glusterfs Pulling image "springcloud/spring-cloud-skipper-server:2.1.2.RELEASE"
Normal Pulled 42m kubelet, node6-scdf2-glusterfs Successfully pulled image "springcloud/spring-cloud-skipper-server:2.1.2.RELEASE"
Normal Created 42m kubelet, node6-scdf2-glusterfs Created container scdf2-data-flow-skipper
Normal Started 42m kubelet, node6-scdf2-glusterfs Started container scdf2-data-flow-skipper
Skipper pod is working ok, there is no log , ( debug level) and scdf shows connection timeout error ... There is only a running task under scheduling per minute ... nothing else running ..
I'll keep watching ...
Thank you for the details, @eskuai. (Aside: please make sure to review and remove any sensitive credentials from the previous comment)
Just curious. How's the K8s cluster health, and the overall resource capacity of it? Any issues/errors with the nodes on CPU/memory/network? You may want to review your network configurations within the cluster as well. Generally, though, it is hard to reason through what might attribute to the connection timeout since it is specific to your cluster and the network configurations.
To troubleshoot it from a different angle, is this happening on a specific operation in SCDF? If yes, what is it?
Hi @sabbyanandan
K8s cluster health is ok.. there was no problem at any level, network, memory,disk ...
K8s have a master and 6 nodes, [aws ec2 m4.2xlarge]... enough ram and cpu (32gb, 8vcore per instance).
We test another projects, but, the primary use is testing scdf2 platform...
scdf and skipper is running into node1 and node2
[root@k8s-master templates]# kubectl describe node node1-scdf2 | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 31 May 2019 07:33:16 +0200 Fri, 31 May 2019 07:33:16 +0200 WeaveIsUp Weave pod has set this
MemoryPressure False Tue, 15 Oct 2019 18:49:13 +0200 Wed, 29 May 2019 19:38:31 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 15 Oct 2019 18:49:13 +0200 Wed, 29 May 2019 19:38:31 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 15 Oct 2019 18:49:13 +0200 Wed, 29 May 2019 19:38:31 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 15 Oct 2019 18:49:13 +0200 Wed, 29 May 2019 19:38:51 +0200 KubeletReady kubelet is posting ready status
[root@k8s-master templates]# kubectl describe node node2-scdf2 | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 31 May 2019 07:33:13 +0200 Fri, 31 May 2019 07:33:13 +0200 WeaveIsUp Weave pod has set this
MemoryPressure False Tue, 15 Oct 2019 18:49:09 +0200 Wed, 29 May 2019 19:45:27 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 15 Oct 2019 18:49:09 +0200 Wed, 29 May 2019 19:45:27 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 15 Oct 2019 18:49:09 +0200 Wed, 29 May 2019 19:45:27 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 15 Oct 2019 18:49:09 +0200 Wed, 29 May 2019 19:45:47 +0200 KubeletReady kubelet is posting ready status
I think that I should enable node-problem-detector for a hours, but i dont think that show me nothing. I'think about it ...
There are a lot AWS alerts that warn me at 80% limit capacity, and all alerts are ok ... nothing about cpu, memory, net... this is a environment with a very very low use ...
Today, the environment is full use to test scdf task ... start, stop, destroy, restart , logging, configuration, scheduling, check log occupation, disk pods occupation, failed task management, etc .. config tz, connections, database pool size, max task execution config, etc ... that we are trying ...
There are only a single task running into scheduling /1 * , reads k8s secrets and config and print ... nothing else ... and logs are show, datetime, etc...
Thanks for a thorough walkthrough. If the primary use of SCDF is just for Tasks, you don't need Skipper in the deployment at all. You can even disable the streaming features cleanly by setting SPRING_CLOUD_DATAFLOW_FEATURES_STREAMS_ENABLED
to false
. Something to think about, though.
I know that doesn't answer or address the connectivity issue that you see in your setup, but I do not have any other ideas. Let's see if @chrisjs has any thoughts to share.
intermittent connectivity issues between pods which appear to be scheduled on two different nodes:
with no pod restarts due to resource issues, etc would lead me to believe there's some sort of connectivity issue in your setup. in one of the logs i also saw something with Weave, not sure if this is being used as a networking component or whatnot.
if this is something while intermittent, but happens enough that its likely to be reproduced easily, one could rule out connectivity issues by ensuring both pods are being scheduled on the same node via a deployer property, ie: https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#configuration-kubernetes-deployer
for example, depending on your needs/requirements, some ideas would be nodeSelector
, tolerations
etc. try pegging to same node, try pegging to other nodes, etc - might help narrow down the problem. remove anything not needed such as maybe Weave if its part of your networking stack, etc.
those would likely be the simplest testing approaches to start with.
Hi @chrisjs, thank you for the info and the plan ...
https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/
But, why doesn't it happen never applying scdf 1.7.3 .. we are deploying same apps ... aarrggg Let's go testing
About k8s use,
1) No problem running scdf2 and skipper into the same pod 2) Check avoid use firewalld in your worker nodes ... disabled it if you can 3) If you can't disable, watch out about SNAT using netfilter 4) Disabled firewall, watch out about kernel tcp config 4.1) Check using conntrack , verify time-outs 4.2) Update sysctl.conf config
sysctl -w net.ipv4.tcp_fin_timeout=20
sysctl -w net.ipv4.tcp_tw_reuse=1
5) Apply stable/node-problem-detector for a time getting logs . 6) using weave cni, check for big mtu size 7) if using ec2, disable network enable source/dest checking
3 days, no fails
Tx
@eskuai: Thank you for the update. This would help others if they run into similar issues.
Description: As i user, i can see, that scdf2 show a "warn" , incluing a "big" stack trace info, about connection timeout from scdf2 to skipper.
Release versions: scdf 2.2.1 skipper 2.1.2
Custom apps: No stream o task related
Steps to reproduce: I just watch logs info, and scdf shows a WARN including a stacktrace with a connection to skipper problema.
No restart scdf2 neither skipper on k8s dashboard freeze for a seconds, and succedely it starts work again ...
Screenshots: Where applicable, add screenshots to help explain your problem.
Additional context: Add any other context about the problem here.