Cluster Management fails when run from OSX

displague commented 1 year ago

User Story

As a [developer/user/operator] I would like to [high level description] for [reasons]

Detailed Description

[A clear and concise description of what you want to happen.]

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

cprivitere commented 1 year ago

More specifically, we believe the issue is with docker desktop on MacOS.

ctreatma commented 1 year ago

On macOS, using Docker Desktop, the machine controller fails to bring up devices in Equinix Metal. The logs are filled with messages like this:

E0309 17:22:22.216338       1 controller.go:317] controller/packetmachine "msg"="Reconciler error" "error"="failed to create scope: failed to get workload cluster client: failed to create client for Cluster default/my-cluster: Get \"https://139.178.81.91:6443/api?timeout=10s\": dial tcp 139.178.81.91:6443: connect: connection refused" "name"="my-cluster-control-plane-m29n2" "namespace"="default" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="PacketMachine"

The error message above comes from here: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/main/controllers/packetmachine_controller.go#L130

Further investigation indicates that, for some reason, connections to any closed port from a container running in Docker Desktop on macOS fail with Connection refused instead of a connection timeout. Docker Desktop on macOS uses a VM to host the containers, and something about the configuration of that VM appears to cause containers to see closed ports on external services as open but non-responsive:

# From a terminal on the macOS host
$ curl google.com:43421
curl: (28) Failed to connect to google.com port 43421 after 75006 ms: Couldn't connect to server
# From an instance of mikefarah/yq:4.31.1 running in docker on the same macOS host
$ wget google.com:43421
Connecting to google.com:43421 (172.217.1.110:43421)
wget: can't connect to remote host (172.217.1.110): Connection refused

This behavior does not happen with Colima on macOS (which also runs containers in a host VM), so this issue appears to be specific to Docker Desktop on macOS.

We could potentially fix this by treating a Connection refused error the same way we treat a timeout in this code: https://github.com/kubernetes-sigs/cluster-api-provider-packet/blob/365fddba549cb5fa5b32ec469e5cfbb4d3481114/pkg/cloud/packet/scope/machine.go#L342-L355

However, that assumes that Connection refused is always a startup problem, and maybe we can't rely on that to be the case?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 9 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/496#issuecomment-1951457762): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cprivitere commented 8 months ago

/reopen

cprivitere commented 8 months ago

/remove-lifecycle rotten

k8s-ci-robot commented 8 months ago

@cprivitere: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/496#issuecomment-2007134742): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-packet/issues/496#issuecomment-2293575387): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-packet

Cluster Management fails when run from OSX #496