kube-hunter CI job is flaky

kinvolk-archives / lokomotive-kubernetes

Lokomotive is a 100% open-source Kubernetes distribution from the folks at Kinvolk

https://kinvolk.io

MIT License

144 stars 20 forks source link

kube-hunter CI job is flaky #145

Closed invidian closed 4 years ago

invidian commented 4 years ago

It sometimes doesn't finish within 7 minutes for some reason, which makes CI job to fail. We should investigate that.

invidian commented 4 years ago

The job is killed with following logs:

3 tasks left
3 tasks left
3 tasks left
3 tasks left
3 tasks left
('Connection aborted.', OSError(0, 'Error')) on 147.75.84.47:443
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
2 tasks left
('Connection aborted.', OSError(0, 'Error')) on 147.75.84.193:8080
final hook is hanging
1 tasks left
final hook is hanging
1 tasks left
final hook is hanging
1 tasks left
final hook is hanging
1 tasks left
final hook is hanging
1 tasks left

I wonder there is some timeout missing for this last task... Just need to figure out a way to reproduce it, probably patch kube-hunter to figure out which task it is and then look into that.

CC @surajssd

invidian commented 4 years ago

The last task eventually finished with following result, when I tried to reproduce it:

1 tasks left
final hook is hanging
1 tasks left
('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) on 147.75.32.35:6443
Starting new HTTPS connection (1): 147.75.32.35:6443
HTTPSConnectionPool(host='147.75.32.35', port=6443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f44a43355d0>: Failed to establish a new connection: [Errno 111] Connection refused')) on 147.75.32.35:6443
Event <class 'src.core.events.types.common.HuntFinished'> got published with <src.core.events.types.common.HuntFinished object at 0x7f44a4ae5110>

invidian commented 4 years ago

It seems that kube-hunter scans /24 of obtained public IP of the pod (for outgoing traffic), looking for API server. That might be detected as an abuse by some IaaS providers (e.g. Hetzner). And that seems to be finding some false-positives (perhaps other clusters?). In combination with --active flag, it may attack other clusters then...

Also the kube-hunter runtime doesn't seem to be deterministic:

Cluster nodes: 2, Runtime: 4m54s
Cluster nodes: 3, Runtime: 5m59s
Cluster nodes: 3, Runtime: 89s
Cluster nodes: 3, Runtime: 2m6s
Cluster nodes: 2, Runtime: 7m5s

invidian commented 4 years ago

Seems that some servers which kube-hunter tries to probe takes really long time to respond:

130 ✗ (1.270s) 11:29:15 invidian@dellxps15mateusz ~/repos/kinvolk/kube-hunter (master)$ curl -v -s -k https://147.75.32.35:6443
*   Trying 147.75.32.35:6443...
* TCP_NODELAY set
* Connected to 147.75.32.35 (147.75.32.35) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* Operation timed out after 300495 milliseconds with 0 out of 0 bytes received
* Closing connection 0
28 ✗ (5m0s) 11:34:17 invidian@dellxps15mateusz ~/repos/kinvolk/kube-hunter (master)*$

I think HTTP probe should timeout earlier than 5 minutes...

surajssd commented 4 years ago

Do you think we are missing any pre requisite checks that we should be doing before installing?

invidian commented 4 years ago

Do you think we are missing any pre requisite checks that we should be doing before installing?

Can you elaborate? What checks do you have in mind for example? I'm not sure if I understand.

invidian commented 4 years ago

Created following issues in upstream:

And one PR:

https://github.com/aquasecurity/kube-hunter/pull/294

I also tested, that when added timeout to the discovery, then kube-hunter runs are much faster. I'll do one more round of testing, and my suggestion would be to use patched version of kube-hunter image until the issue is not solved upstream.

surajssd commented 4 years ago

Can you elaborate? What checks do you have in mind for example? I'm not sure if I understand.

Before we deploy kube-hunter we do following checks (not extensive) but to make sure cluster is responsive:

https://github.com/kinvolk/lokomotive-kubernetes/blob/1d4faacd1fa5f78aeb8444c6370ad16d88c46f46/scripts/kube-hunter.sh#L25-L32

I meant do we need to add anything more here?

invidian commented 4 years ago

I meant do we need to add anything more here?

No, I think those checks are fine. I believe the issue is in kube-hunter itself, as described above.