teleport: Original Error: *trace.ConnectionProblemError audit writer is closed).

programmerq commented 2 years ago

Description

What happened:

Teleport logs a ConnectionProblemError stacktrace in multiple nodes in the cluster. The affected nodes are all running at least centos 7.8.

teleport: 2021-11-17T04:17:03-08:00 INFO [SESSION:N] Closing session 367bc28b-fc8c-4ae1-ac2c-61c0cbe4756e. srv/sess.go:635
teleport: 2021-11-17T04:17:03-08:00 ERRO [SESSION:N] Failed to emit session print event. error:[
teleport: ERROR REPORT:
teleport: Original Error: *trace.ConnectionProblemError audit writer is closed
teleport: Stack Trace:
teleport: /go/src/github.com/gravitational/teleport/lib/events/auditwriter.go:294 github.com/gravitational/teleport/lib/events.(*AuditWriter).EmitAuditEvent
teleport: /go/src/github.com/gravitational/teleport/lib/events/auditwriter.go:220 github.com/gravitational/teleport/lib/events.(*AuditWriter).Write
teleport: /go/src/github.com/gravitational/teleport/lib/srv/sess.go:1321 github.com/gravitational/teleport/lib/srv.(*multiWriter).Write
teleport: /opt/go/src/io/io.go:411 io.copyBuffer
teleport: /opt/go/src/io/io.go:368 io.Copy
teleport: /go/src/github.com/gravitational/teleport/lib/srv/sess.go:794 github.com/gravitational/teleport/lib/srv.(*session).startInteractive.func1
teleport: /opt/go/src/runtime/asm_amd64.s:1374 runtime.goexit
teleport: User Message: audit writer is closed] events/auditwriter.go:221

What you expected to happen:

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem. 1. 2. 3.

Server Details

Teleport version (run teleport version): Teleport 7.3.7 enterprise
Server OS (e.g. from /etc/os-release): centos 7.8
Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): on-prem
Additional details:

Client Details

n/a

Debug Logs

gz#3553

zmb3 commented 2 years ago

Looks like this is happening when they close the session?

Including the session recording config would be helpful as well: tctl get session_recording_config.

programmerq commented 2 years ago

@zmb3 The session_recording_config:

$ sudo /usr/local/bin/tctl get session_recording_config
kind: session_recording_config
metadata:
  id: 270211497
  labels:
    teleport.dev/origin: config-file 
  name: session-recording-config 
spec: 
  mode: node
  proxy_checks_host_keys: true
version: v2

and yes, this does appear to happen when a session ends. The stack trace appears but the recording is still accessible and plays back successfully.

programmerq commented 2 years ago

After talking with our devs, they shared some ideas for repro attempts:

This is a print event, so try doing a lot of output (ls -lahtrR / or maybe some really long unbuffered output to stdout?).
Try closing the terminal and/or web ui rather than exiting the session by typing exit.
Try it with bpf and without bpf session recording.

I'll try those in my lab and see if that makes the issue happen.

zmb3 commented 2 years ago

It looks like this is mostly harmless, though we are dropping a small amount of data. What's happening is that AuditWriter is asynchronous, so:

Write() and EmitAuditEvent() queue up events to be processed by a background routine
The background routine reads queued up events and writes them to the audit stream
Both Write()/EmitAuditEvent() (the things queueing up events) and the background routine (the thing reading the events) check for when the context is closed, so there's a little bit of a race to see who detects the shutdown first.

As things are shutting down, depending on timing, it's possible that one of the final writes sees that the writer has closed, and the event is dropped and this message is printed, while the background routine also sees that the context is closed and stops reading events.

We could clean this up by closing the channel that acts as an events queue when the audit writer is closed. This informs the background routine that no more data is to be expected, and it can read through the end of the queue before completing the stream. Care would have to be taken to ensure no writes occur to the closed channel, but this approach would both remove this pesky error and ensure that the last write or two isn't dropped as the session is ending.

I will take a look at this as part of the desktop access session recording work, because the same issue can happen there.

gregoryb commented 2 years ago

Trying to understand a bug i have with Ansible which failed, i'm running Teleport in debug mode to understand and i have seen a similar message

2022-01-13T21:09:59Z DEBU [SESSION:N] Session has encountered 1 slow writes out of 2. Check disk and network on this server. events/auditwriter.go:384
2022-01-13T21:09:59Z DEBU [UPLOAD]    Scanned 6 uploads, started 6 in /var/lib/teleport/log/upload/streaming/default. filesessions/fileasync.go:289
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:52.393822ms session-id:9a73bb1a-1ff2-42a1-bb40-5fdc8b5fc8e6 filesessions/fileasync.go:443
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:52.681424ms session-id:1937bcb8-41db-4cbd-b66e-b1f6d77e267a filesessions/fileasync.go:443
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:52.571223ms session-id:68ab0a33-3eef-439a-881b-8115f1a6dd8d filesessions/fileasync.go:443
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:59.61868ms session-id:c993c930-ebef-4858-b3d2-13749ff2d267 filesessions/fileasync.go:443
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:62.029899ms session-id:87aaa86c-1e95-406c-98be-b5dcecdb38b4 filesessions/fileasync.go:443
2022-01-13T21:09:59Z DEBU [UPLOAD]    Session upload completed. duration:66.342035ms session-id:e3e1092f-cc10-4de5-ba08-e1837f68f31a filesessions/fileasync.go:443
2022-01-13T21:10:00Z WARN [NODE]      Failed to emit session end event. error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError audit writer is closed
Stack Trace:
    /go/src/github.com/gravitational/teleport/lib/events/auditwriter.go:296 github.com/gravitational/teleport/lib/events.(*AuditWriter).EmitAuditEvent
    /go/src/github.com/gravitational/teleport/lib/srv/sess.go:1027 github.com/gravitational/teleport/lib/srv.(*session).startExec.func1
    /opt/go/src/runtime/asm_amd64.s:1371 runtime.goexit

zen commented 2 years ago

teleport: 9.0.1

I'm having the same issue with massive amount of servers. Trying to use teleport transport with Ansible na getting this error. This happens at the very beginning when ansible is copying files to the server:

May 18 19:18:28 example-host teleport[18913]: 2022-05-18T19:18:28Z INFO [NODE]      Creating (exec) session 86cd5d2d-cd1b-4808-9349-49adab29d822. id:15 local:172.x.x.x:51194 login:ec2-user remote:10.1.x.x:40946 teleportUser:zen srv/sess.go:246
May 18 19:18:28 example-host teleport[18913]: 2022-05-18T19:18:28Z INFO [AUDIT]     session.start addr.remote:10.1.x.x:40946 cluster_name:tele.example.com code:T2000I ei:0 event:session.start login:ec2-user namespace:default server_addr:172.x.x.x:51194 server_hostname:example-host server_id:346086605621-i-0689c70ee5a0bce15 server_labels:map[aws_hostname_int:x.compute.internal aws_hostname_pub:x.compute.amazonaws.com aws_instance_id:i-xxx aws_instance_type:m5.2xlarge aws_region:x env:production] session_recording:node sid:86cd5d2d-cd1b-4808-9349-49adab29d822 time:2022-05-18T19:18:28.656Z uid:69714471-3c39-4c13-ba76-88247f9ce504 user:zen events/emitter.go:325
May 18 19:18:28 example-host teleport[18913]: 2022-05-18T19:18:28Z INFO [NODE]      Started local command execution: "/bin/sh -c '( umask 77 && mkdir -p \"` echo /tmp `\"&& mkdir \"` echo /tmp/ansible-tmp-1652901501.0927172-86570-230694619054933 `\" && echo ansible-tmp-1652901501.0927172-86570-230694619054933=\"` echo /tmp/ansible-tmp-1652901501.0927172-86570-230694619054933 `\" ) && sleep 0'" id:15 local:172.x.x.x:51194 login:ec2-user remote:10.1.x.x:40946 teleportUser:zen srv/exec.go:182
May 18 19:18:28 example-host teleport[18913]: 2022-05-18T19:18:28Z INFO [SESSION:N] Closing session 86cd5d2d-cd1b-4808-9349-49adab29d822. srv/sess.go:747
May 18 19:18:28 example-host teleport[18913]: 2022-05-18T19:18:28Z INFO [AUDIT]     exec addr.local:172.x.x.x:51194 addr.remote:10.1.x.x:40946 code:T3002I command:/bin/sh -c '( umask 77 && mkdir -p "` echo /tmp `"&& mkdir "` echo /tmp/ansible-tmp-1652901501.0927172-86570-230694619054933 `" && echo ansible-tmp-1652901501.0927172-86570-230694619054933="` echo /tmp/ansible-tmp-1652901501.0927172-86570-230694619054933 `" ) && sleep 0' ei:0 event:exec exitCode:0 login:ec2-user namespace:default server_id:346086605621-i-0689c70ee5a0bce15 sid:86cd5d2d-cd1b-4808-9349-49adab29d822 time:2022-05-18T19:18:28.847Z uid:34769497-8586-47fc-af4f-eabec195e0c1 user:zen events/emitter.go:325
May 18 19:18:29 example-host teleport[18913]: 2022-05-18T19:18:29Z INFO [AUDIT]     session.data addr.remote:10.1.x.x:40946 code:T2006I ei:2.147483646e+09 event:session.data login:ec2-user namespace:default rx:3990 server_id:346086605621-i-0689c70ee5a0bce15 sid:86cd5d2d-cd1b-4808-9349-49adab29d822 time:2022-05-18T19:18:29.035Z tx:5437 uid:d2b12ea8-8e72-41f2-96aa-55497a557f0e user:zen events/emitter.go:325
May 18 19:18:30 example-host teleport[18913]: 2022-05-18T19:18:30Z WARN [NODE]      Failed to emit session end event. error:[
May 18 19:18:30 example-host teleport[18913]: ERROR REPORT:
May 18 19:18:30 example-host teleport[18913]: Original Error: *trace.ConnectionProblemError audit writer is closed
May 18 19:18:30 example-host teleport[18913]: Stack Trace:
May 18 19:18:30 example-host teleport[18913]: /go/src/github.com/gravitational/teleport/lib/events/auditwriter.go:316 github.com/gravitational/teleport/lib/events.(*AuditWriter).EmitAuditEvent
May 18 19:18:30 example-host teleport[18913]: /go/src/github.com/gravitational/teleport/lib/srv/sess.go:1216 github.com/gravitational/teleport/lib/srv.(*session).startExec.func1
May 18 19:18:30 example-host teleport[18913]: /opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
May 18 19:18:30 example-host teleport[18913]: User Message: audit writer is closed] id:15 local:172.x.x.x:51194 login:ec2-user remote:10.1.x.x:40946 teleportUser:zen srv/sess.go:1217

On ansible end I'm getting:

fatal: [example-host]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection to example-host closed.", "unreachable": true}

I'm getting this on about 40 out of 96 servers where I run this playbook

gravitational / teleport