Closed WMP closed 3 years ago
This is probably hitting a timeout on the daemon, I have to check where that is set. In newer versions (0.3.x and up) we added Docker retries when we hit errors which might help in this situation.
So if i good understand, you want to check where this timeout is set and i can increase this timeout in source, yes? I cannot use rke 0.3 because i dont want to upgrade this k8s, because now calico is 0/1 ready.
This isnt that line: https://github.com/rancher/rke/blob/v0.2.11/hosts/dialer.go#L16 ?
We change this timeout to 600 and rke up execute successfull. Is possible to make this parameter configurable?
DEBU[0057] [certificates] Successfully started Certificate deployer container: 5cf12890e6a19e1354bb80adabfaa076a6566dbe84359efe0370a9065ca82335
DEBU[0057] Checking if container [cert-deployer] is running on host [XXX.XXX.0.25]
DEBU[0058] [certificates] Successfully started Certificate deployer container: 3a286802e66e3f69b4f1b0b35057a6027d31c63a34800191465610672569a26f
DEBU[0058] Checking if container [cert-deployer] is running on host [XXX.XXX.0.7]
Can you confirm the timeout is hit when you query the Docker daemon when rke up
is running? 50 seconds is already quite a lot but I want to confirm that it is not enough, and making it configurable is another configuration option to consider. It might help to raise the default timeout but I need some more info for that. I can also try to reproduce myself but it will take a bit more time.
How can i reproduce over docker -H what is doing in step: Successfully started Certificate deployer container
?
If DEBU[0057] is seconds, this means that rke got result from docker after 50s. In log with timeout, on this host previous log is from 0003 s:
DEBU[0003] [certificates] No pull necessary, image [rancher/rke-tools:v0.1.50] exists on host [XXX.XXX.0.25]
and timeout i have on
FATA[0053] [Failed to start Certificates deployer container on host [XXX.XXX.0.25]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]
I havent whole log from success, but i can suspect that entry No pull necessary, image [rancher/rke-tools:v0.1.50] exists on host [XXX.XXX.0.25]
has this same seconds: 0003, so in success deployment step DEBU[0057] Checking if container [cert-deployer] is running on host [XXX.XXX.0.25]
took 57 - 3 = 54 seconds.
You must know that yesterday i have huge load on this node:
Right, so I'm wondering if it's worth to make it configurable as with the current version, it will be retried and when the node recovered it would still work. And 50 seconds is quite a timeout already.
I thinks that is worth to make possible set this timeout from CLI. I imagine that i must add new node because my old nodes has very huge loadt, and i cannot do that because static timeout is too short. When rke trying to retried, then close current connection to docker daemon and and try next time with this same timeout. If you really want to use retried, i thinks that timeout should be increased on every retried.
I think if we make this configurable, it will not solve everything as it will hang or break on another component that has a fixed timeout or retry. So if we are going to fix it, we need to test on a host under huge load and test if it can survive on all steps of the process.
This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
RKE version: 0.2.11
Docker version: (
docker version
,docker info
preferred) Docker version 18.09.9, build 039a7df9baOperating system and kernel: (
cat /etc/os-release
,uname -r
preferred) ubutnu 18.04 4.15.0-45-genericType/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-Metal
cluster.yml file:
Steps to Reproduce:
I have always this error on XXX.XXX.0.25 or on XXX.XXX.0.7
My daemon.json:
I can without any problems do:
docker -H ssh://XXX.XXX.0.25 ps
and this took 2s. I cannot see any interesting errors in docker log or in dmesg.