Intermittent errors with using SSH

AndreiPashkin commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and docker Provider) Version

Terraform v1.0.3
on linux_amd64
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/kreuzwerker/docker v2.14.0
+ provider registry.terraform.io/scottwinkler/shell v1.7.7

Affected Resource(s)

docker_container

Debug Output

╷
│ Error: Unable to read Docker image into resource: unable to pull image yandex/clickhouse-server:20.3.9.70: command [ssh -l username -- 123.123.123.123 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
│ 
│   with module.clickhouse.docker_image.clickhouse,
│   on modules/clickhouse/main.tf line 23, in resource "docker_image" "clickhouse":
│   23: resource "docker_image" "clickhouse" {
│ 
╵

Expected Behaviour

No error.

Actual Behaviour

Intermittent errors with error messages as above.

Steps to Reproduce

Just try to launch 10-15 different containers with different images (kafka, postgres, etc, etc) over SSH. SSH is the most important thing.

suzuki-shunsuke commented 3 years ago

Maybe this is related. https://github.com/kreuzwerker/terraform-provider-docker/issues/80

AndreiPashkin commented 3 years ago

@suzuki-shunsuke, maybe it worth to make some sort of debug-build that would print in debug output more details about why the process is being killed? I could use it to obtain more info.

adam-lebon commented 3 years ago

Maybe this is related. #80

I don't think this is related to #80. I have been using this docker provider for a year and I started to seeing this error since a few weeks when I try to deploy more than ~10 containers. It appears randomly when too many ssh connections are established. Even with -parallelism=2, the error still occurr. The terraform debug doesn't provide any useful informations and the stderr is empty as mentioned by @AndreiPashkin.

I will try to investigate into this during the next weeks

AndreiPashkin commented 3 years ago

@adam-lebon,

Maybe it has something to do with SSH keep-alive settings or some settings related to simultaneous connections number?

dubo-dubon-duponey commented 3 years ago

Can confirm the problem here (which in my case is happening at about every run - with a large deployment).

I believe it is a regression from previous versions, as I recently upgraded to 2.15 and never had this issue before.

dubo-dubon-duponey commented 3 years ago

Reusing sockets seems to help with the issue in my use case (https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/#:~:text=The%20ControlMaster%20option%20is%20one,over%20the%20same%20underlying%20connection.)

Pretty much:

Host *
    ControlMaster     auto
    ControlPath       ~/.ssh/control-%C
    ControlPersist    yes

Not clear to me what could have changed (I assume in the vendored docker/cli) - but they do also mention it in their docs: https://docs.docker.com/engine/security/protect-access/#ssh-tips

This is not a fix though, and likely to cause issues.

cameronelliott commented 3 years ago

Wow, I was having issues with this provider on DigitalOcean, and it turns out I was hitting iptables rate limiting thresholds for ssh. I didn't know about the ControlPersist ssh setting, but I believe that would have fixed the issue for me.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. If you don't want this issue to be closed, please set the label pinned.

dubo-dubon-duponey commented 2 years ago

Closing bug reports for lack of activity on the project might not be the right way to deal with issues...

compojoom commented 2 years ago

I was also trying to deploy 10 docker containers on a digital ocean droplet and was at first getting a slightly different error: Connection refused

After adding the following lines

Host *
    ControlMaster     auto
    ControlPath       ~/.ssh/control-%C
    ControlPersist    yes

to my ssh config as suggested above resolved the issue for me.

Just a note. Once I added the above lines and tried to run terraform apply I got the same error as in the original issue:

Unable to read Docker image into resource: unable to pull image xxx: error pulling image xxx: error during connect: Post "http://docker.example.com/v1.41/images/create?fromImage=xxx": command [ssh -l root -- xxx.xxx.xxx.xxx docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

Rerunning the command once again fixed this.

dubo-dubon-duponey commented 2 years ago

@compojoom indeed I do not think the ssh suggestion from above fixes the issue (it merely hides it / alleviates part of it).

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. If you don't want this issue to be closed, please set the label pinned.

dubo-dubon-duponey commented 2 years ago

Commenting to keep this opened.

maximegirardet commented 2 years ago

Same with 2.16, I get unable to pull when there are more than 2 containers to deploy.

I reverted to 2.11 and the problem is not present.

jesse-savary commented 2 years ago

Also running into this issue -- things work (mostly) fine with one container but if I try and deploy my whole stack the pipe will break after a couple hundred lines to STDOUT

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days. If you don't want this issue to be closed, please set the label pinned.

adam-lebon commented 2 years ago

It seems to have been introduced in version 2.12.0. I am investigating to find the origin of this problem. I'll post updates as soon I'll find out the fix

adam-lebon commented 2 years ago

After testing multiple commit manually, the bug has been introduced when migrating to the terraform-sdk v2 (MR Link) which is a huge MR 😭

binlab commented 2 years ago

The same error on large deployments, came from version 2.12 where it works more slowly and failed with an error connection refused, at 2.11 there is no error and everything works fast

prologic commented 2 years ago

Have run into this as well just today. Has anyone spent any time digging into the root cuase yet?

AmazingPangWei commented 2 years ago

Hi, guys. You can reproduce this bug as shown below:

terraform {
  required_providers {
    docker = {
      source  = "kreuzwerker/docker"
      version = "2.19.0"
    }
  }
}

provider "docker" {
  host  = "ssh://root@127.0.0.1:22"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null"]
}

resource "docker_container" "nginx" {
  count = 15
  name  = "nginx_${count.index}"
  image = "nginx:latest"
}

If you execute terraform apply, you may get error detail like this:

Unable to create container: error during connect: Post "http://docker.example.com/v1.40/containers/create?name=nginx_8": command [ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -l root -p 22 -- 127.0.0.1 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts

But if you execute terraform apply -parallelism=1, that's worked.

So in my opinion , this bug may be caused by concurrency.

prologic commented 2 years ago

@AmazingPangWei That may be one possible problem for sure.

Q: Did you run this with the latest provider here that was released a little while ago? There was a patch that added some additional error handling to try to identify errors like this:

docker system dial-stdio] has exited with signal: killed

This suggests to me the docker system dial-stdio process was killed abruptly by the kernel. The only reason that happens (that I know of) is the OOM killer killed the process. That might make sense if spawning 15 docker client processes consume all of the memory on the machine?

Can you provide evidence of that perhaps?

k2m30 commented 2 years ago

Hey, just ran into the same issue working with docker terraform provider.

Error: Error waiting for container removal 'd6a56fd5d1703c249f4036778c622ba57141878fba81f47c1ec0c9623e2684af': error during connect: 

Post "http://docker.example.com/v1.39/containers/d6a56fd5d1703c249f4036778c622ba57141878fba81f47c1ec0c9623e2684af/wait?condition=not-running": 

command [ssh -i ./ssh_keys/kvinta.kz -l devops -- 37.151.107.164 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

I checked the Docker daemon log and found the following:

msg="Container d6a56fd5d1703c249f4036778c622ba57141878fba81f47c1ec0c9623e2684af failed to exit within 0 seconds of signal 15 - using the force"

msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

msg="shim reaped" id=d6a56fd5d1703c249f4036778c622ba57141878fba81f47c1ec0c9623e2684af

It seems, that failed to exit within 0 seconds of signal 15 - using the force means there is no timeout passed for SIGTERM.

So I wonder, if there is a way to pass such a parameter? Either through a terraform provider config or ~/.docker/config.json or whatever?

k2m30 commented 2 years ago

@prologic it seems, that passing "shutdown-timeout": 30 through config_file_content does not change a timeout either.

k2m30 commented 2 years ago

This is what I have with log-level=debug

msg="Calling GET /v1.39/containers/fdff2dee80033c2c032c124cdea24e88a8412a43d83ff8a28f988bc822cbb7ac/json"

msg="Calling POST /v1.39/containers/bafafd4eedc8b4fd5c0aca53abe9698830abdd6366701637223e4dd5e81cc36b/stop?t=0"

time="2022-08-04T09:28:17.944841125" msg="Sending kill signal 15 to container bafafd4eedc8b4fd5c0aca53abe9698830abdd6366701637223e4dd5e81cc36b"

msg="Container bafafd4eedc8b4fd5c0aca53abe9698830abdd6366701637223e4dd5e81cc36b failed to exit within 0 seconds of signal 15 - using the force"

time=2022-08-04T09:28:17.968797818" msg="Sending kill signal 9 to container bafafd4eedc8b4fd5c0aca53abe9698830abdd6366701637223e4dd5e81cc36b"

msg="shim reaped" id=bafafd4eedc8b4fd5c0aca53abe9698830abdd6366701637223e4dd5e81cc36b

adam-lebon commented 2 years ago

@k2m30 I don't think this issue is related to the server in any way, since downgrading the docker provider version solves the issue.

But while digging into the issue, I found out that the docker-compose community has been facing the exact same issue since May 2021 (which match the very first message of this issue). So during all this time, I blamed this PR "terraform-sdk v2" PR while the bug seems to be more related to the docker client.

The problematic version of the provider bumped to docker client from v20.0.0 to v20.10.5, It may be something to investigate on

adam-lebon commented 2 years ago

Someone posted a solution on the docker-compose issue that not only works, but also speeds up the deployment since the docker socket is passed through a single socket instead of opening multiple ssh connections.

https://github.com/docker/compose/issues/8544#issuecomment-1060664712

Another option if you want to connect over SSH but not deal with all of docker's ssh flakiness is to set up unix socket forwarding over the SSH connection. I ended up writing a script:
dockerssh() {
  rm -f /tmp/docker.sock
  cleanup() {
    ssh -q -S docker-ctrl-socket -p "${PORT}" -O exit "${HOST}"
    rm -f /tmp/docker.sock
  }
  trap "cleanup" EXIT
  ssh -M -S docker-ctrl-socket -p "${PORT}" -fnNT -L /tmp/docker.sock:/var/run/docker.sock "${HOST}"
  DOCKER_HOST=unix:///tmp/docker.sock eval "$*"
}
Then dockerssh docker compose ... to run a docker command pointing at that host or dockerssh bash to start a new shell pointing at that remote host.

I use it this way in my CICD scripts :

ssh -M -S ssh-control-socket -fnNT -L /tmp/docker.sock:/var/run/docker.sock "${HOST}"
DOCKER_HOST=unix://./docker.sock terraform apply -auto-approve
ssh -O exit -S ssh-control-socket "${HOST}"

Don´t forget to remove the host = "ssh:/..." in your terraform configuration

Junkern commented 2 years ago

Maintainer here: I am monitoring this issue closely and reading every comment. Under the hood we are simply using the docker client, so any issues from the docker client also appear in this provider. So anyone with that issue, please try out the "docker ssh" workaround from the comment above.

I still have not managed to build a reproducible case myself, that's the first thing on my list. I won't have time in the next 2-3 weeks, but after that hopefully will try to tackle this. Even after building a reproducible case I am not sure whether there will be a single/simple solution for that issue. Let's see...

adam-lebon commented 2 years ago

Hello @Junkern ,

Thank you so much for investigating this problem. Here are some clues I found while digging into this problem:

How to reproduce

It is very easy to reproduce this bug. Deploy a simple ubuntu/debian VM with all default settings.

Then use this terraform config :

terraform {
  required_providers {
    docker = {
      source = "kreuzwerker/docker"
      version = "2.20.2"
    }
  }
}

provider "docker" {
  host = "ssh://ubuntu@192.168.100.71"
}

resource "docker_image" "nginx" {
  name = "nginx:latest"
}

resource "docker_container" "nginx" {
  count = 50

  name         = "nginx-${count.index}"
  image        = docker_image.nginx.latest
  network_mode = "none"
}

Some clues about what is happening

Too many ssh session are opened

Looking at the ssh logs, we can see that many ssh sessions are open. Maybe twice as many as usual. You can compare by switching from version 2.12.0 to version 2.11.0.

The responsible MR

The responsible merge request is #102 . I checked out this PR and checks every modification that has been made. The issue appear to be introduced after using new terraform Context aware functions.

I found out it was related to the usage of context aware functions, I created a new branch based on this PR where you can see two new commits:

5f5db946fbb8aa7f280c014ed0f021c8009e5b89 removed the usage of context aware functions. The context is created using context.Background() as it was before the PR
f0e55dcad27a9ce735e9cb7c69e64c67b303e1f5. In this commit I keep the new terraform aware function that has been introduced in the PR, but I ignore the provided context by keeping the usage of context.Background()

Both of these commits works well (please keep in mind I only edited the Create and Read, the Destroy function is kept unchanged).

So I think I've found the origin of the issue, but since I'm not experienced with go, I'm unable to understand the issue I hope you will be more successful.

AndreiPashkin commented 1 year ago

@Junkern, what do you think of replacing builtin Docker ssh capabilities with custom client that would allow to pass connection options (like keep alive, reuse sockets, etc)?

prologic commented 1 year ago

@Junkern, what do you think of replacing builtin Docker ssh capabilities with custom client that would allow to pass connection options (like keep alive, reuse sockets, etc)?

Using an ssh client written in Go? 🤔

AndreiPashkin commented 1 year ago

@Junkern, what do you think of replacing builtin Docker ssh capabilities with custom client that would allow to pass connection options (like keep alive, reuse sockets, etc)?

Using an ssh client written in Go? thinking

Actually it seems like I gave an uninformed comment.

It seems like Docker does allow to pass SSH options: https://github.com/kreuzwerker/terraform-provider-docker/blob/cf3f8d64574622872864591fae36d4d9c15d704b/internal/provider/config.go#L122

AndreiPashkin commented 1 year ago

Hi, guys. You can reproduce this bug as shown below:

terraform {
  required_providers {
    docker = {
      source  = "kreuzwerker/docker"
      version = "2.19.0"
    }
  }
}

provider "docker" {
  host  = "ssh://root@127.0.0.1:22"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null"]
}

resource "docker_container" "nginx" {
  count = 15
  name  = "nginx_${count.index}"
  image = "nginx:latest"
}

If you execute terraform apply, you may get error detail like this:

Unable to create container: error during connect: Post "http://docker.example.com/v1.40/containers/create?name=nginx_8": command [ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -l root -p 22 -- 127.0.0.1 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts

But if you execute terraform apply -parallelism=1, that's worked.

So in my opinion , this bug may be caused by concurrency.

This reply tells me that the problem is in using too many SSH connections.

Am I right, @Junkern, @prologic?

UPD Maybe the solution would be simply to introduce an SSH connection pooling proxy?

anubhavcodes commented 1 year ago

I can confirm @AndreiPashkin's hypothesis, I was having the same issue and their solution have worked perfectly.

But if you execute terraform apply -parallelism=1, that's worked.

compojoom commented 1 year ago

Can someone explain how to use the proposed workaround from @adam-lebon https://github.com/kreuzwerker/terraform-provider-docker/issues/262#issuecomment-1205447939 with terraform? @Junkern ?

AndreiPashkin commented 1 year ago

Can someone explain how to use the proposed workaround from @adam-lebon #262 (comment) with terraform? @Junkern ?

Basically you establish an SSH tunnel with the target host manually and configure provider as if you work on the local Docker.

I think this would be equivalent of using -parallelism=1 option.

timvy commented 1 year ago

I worked around this issue by setting the MaxSessions setting on the target ssh server to 500.

tiaden commented 1 year ago

I'm also commenting on this for the sake of keeping this issue opened. This is a serious issue

Magister-Machinis commented 1 year ago

As a small follow up to the work-around listed above. It doesn't work in WSL or windows it seems, but I got it working ok in a Ubuntu vm after a day or two banging my head against the prior two environments.

cryptiklemur commented 9 months ago

Could you implement retry logic to avoid this? Also throttling interacting with the docker registry to prevent rate limits?

andrewvmail commented 6 months ago

In my case i can see from tail -f /var/log/syslog [UFW LIMIT BLOCK] is getting triggered and blocking the connection

i fixed it by doing this on the target

sudo ufw insert 1 allow from [terraform caller ip] to any port 22

timvy commented 5 months ago

Reusing sockets seems to help with the issue in my use case (https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/#:~:text=The%20ControlMaster%20option%20is%20one,over%20the%20same%20underlying%20connection.)

Pretty much:
Host *
    ControlMaster     auto
    ControlPath       ~/.ssh/control-%C
    ControlPersist    yes
Not clear to me what could have changed (I assume in the vendored docker/cli) - but they do also mention it in their docs: https://docs.docker.com/engine/security/protect-access/#ssh-tips

This is not a fix though, and likely to cause issues.

Adding these options to my provider settings worked for me. I noticed a separate ssh connection was opened for every resource. ssh_opts = ["-o", "ControlMaster=auto", "-o", "ControlPath=~/.ssh/control-%C", "-o", "ControlPersist=yes"]

NelsonFrancisco commented 2 months ago

ssh_opts = ["-o", "ControlMaster=auto", "-o", "ControlPath=~/.ssh/control-%C", "-o", "ControlPersist=yes"] does not work for me.

Only way to work without any random errors is with terraform apply -parallelism=1, which is obviously not great.

Using terraform v1.8.3 and kreuzwerker/docker 3.0.2 and deploying a container to a Google Cloud VM

kreuzwerker / terraform-provider-docker