Open mritd opened 6 years ago
I think I've run into this as well. I tried a few times to execute something like systemctl stop teleport && another thing && another thing && systemctl start teleport
and it always just disconnected and never came back up for me.
Due to this problem, on an unfortunate morning, I killed the production environment redis server...
We experienced the same issue as well. Setting KillMode seems to be the solution. We still like to know if the attached PR is going to be merged or not and if setting the KillMode is a problem or not for teleport.
If somebody as well can explain why some processes get grouped under the teleport process (probably something with control groups and systemd) that would be useful :)
Does this happen with OpenSSH too? Starting a process in your terminal and then logging out should cause that process to get a SIGHUP
, so the bug here seems to be that Teleport didn't kill the process on logout but when Teleport was restarted.
If you want to start a process in your terminal and have it continue to run, you'd use nohup
to start it, or do a exec $SHELL
just before to orphen your process group. Otherwise you should expect the process to be killed on logout.
Most systemd files for openssh have the KillMode=process hence the question if teleport should not have it as well since it acts the same. I understand that you need to start your processes so they get daemonized somehow. But in our case we use a deploy tool (capistrano) which boots up the processes for our rails app. Al fine and dandy until we upgrade teleport and do service teleport restart. All of a sudden all the processes that where started using the deploy tool stopped working. After we set the KillMode=process that does not longer happen. We have seen this to happen mostly with processes that use a socket (god, puma, ...).
What we do not understand (and that is not teleports fault probably we are laking unix knowledge) is why the processes that we started using the deploy tool are failing under the teleport process if we restart it. If we do a service teleport status -l we see al our processes hanging under the main teleport process. If you or somebody else can enlighten us that would be great.
But for now we set the KillMode=process to avoid the issues mentioned above and maybe the PR is really valid and should be merged.
If you have to do it with OpenSSH and folks are modifying those configs too, then that would suggest the traditional behavior of all OSs. When you logout it needs to clean up your session. Though it sounds like Teleport might not be sending a SIGHUP or it’s possible that’s the responsibility of the shell in some cases.
If it doesn’t do that, then it’s leaking ttys I suspect and possibly there are zombies or other processes left running that could eventually exhaust some resource. So OSs need to clean up sessions by default to avoid those issues, not just Unix.
I realize that doesn’t address your issue here, but it sounds like it’s a bug in the Capistrano script. All you would need to do is place a nohup before the command it’s running and that should solve it in theory. Have you tried that?
On Thu, Jan 31, 2019 at 13:25 Buts Johan notifications@github.com wrote:
Most systemd files for openssh have the KillMode=process hence the question if teleport should not have it as well since it acts the same. I understand that you need to start your processes so they get daemonized somehow. But in our case we use a deploy tool (capistrano) which boots up the processes for our rails app. Al fine and dandy until we upgrade teleport and do service teleport restart. All of a sudden all the processes that where started using the deploy tool stopped working. After we set the KillMode=process that does not longer happen. We have seen this to happen mostly with processes that use a socket (god, puma, ...).
What we do not understand (and that is not teleports fault probably we are laking unix knowledge) is why the processes that we started using the deploy tool are failing under the teleport process if we restart it. If we do a service teleport status -l we see al our processes hanging under the main teleport process. If you or somebody else can enlighten us that would be great.
But for now we set the KillMode=process to avoid the issues mentioned above and maybe the PR is really valid and should be merged.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/gravitational/teleport/issues/2355#issuecomment-459512102, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlXzO6NPP8nVoQbZIjnc_eCPDkVPXp8ks5vI19EgaJpZM4YZEsh .
Thnx for more clarification. I will try the recommendation of putting nohup in front of it. Like i said i am not an expert on the matter. I am just trying to understand what is going on. If popular tools like redis get killed after opening up a teleport/ssh session and keep working as long as you do not restart teleport/ssh then that is at least something people should understand. Maybe systemd is more of the thing that cause the problem in our situation.
Ok after reading some more i see why you suggest using nohup. But i am still not understanding something. Is my assumption correct if a teleport session would end (by exit or disconnect) that it will cleanup all running processes? Then it must be related to how systemd handles how processes get killed when a session would end. I am curious to see what would happen if i would would NOT run the teleport binary through systemd and open up a session and start some processes and then exit. What would be the expected behaviour or teleport?
Is there a solution to fix this problem?
I don't think we can simply compare teleport with openssh;
Need to be clear: openssh is a system service, we rarely touch it during normal operation ... teleport is a third-party software, we will perform daily upgrade and maintenance on it to ensure that it works properly and follow the official version upgrade.
From a security perspective, the purpose of our use of teleport is to audit the operation of the user's remote login. The actual problem caused us to maintain an audit tool and caused the abnormal termination of the user program.
As a workaround for now, you can set KillMode=process
in your systemd unit file for Teleport. This should enable sshd
-like behaviour where processes started via Teleport sessions will stay running after Teleport exits.
I will look into ways we can fix this more permanently.
Hey man, it's been over a year ...If setting KillMode=process
is the correct solution, should it be added to the document to avoid more people experiencing the same problem?
If it is the correct solution, then yes. That’s what we need to determine.
Does anyone have a working reproduction for this issue on a modern (~4.3) version of Teleport? I've been trying to reproduce this problem, but can't make it happen.
@webvictim Our node running on teleport 6.2.7 with PIDFile=/run/teleport.pid
still having this problem.
The KillMode=process
fixed it for now
Had a similar issue in a production environment with v9 and v10, environment was Ubuntu 18.04 with our own systemd file. Will research KillMode=process to be thorough, but we also had success resolving it by enabling the PAM integration configuration to the edge nodes.
ssh_service:
pam:
enabled: true
service_name: "sshd"
Our clue to this idea was finally realizing that some of our shell environment scripts/vars that get called on ssh login were not populating on teleport login either.
Does anyone have a working reproduction for this issue on a modern (~4.3) version of Teleport? I've been trying to reproduce this problem, but can't make it happen.
I can confirm. I've had this issue on production servers just yesterday, using teleport v13.2.3. Due to some maintenance we'd done a few months ago, we had some production-critical filesystems mounted via teleport. Doing a restart of the teleport process (I was reinstalling it on a few servers via ansible), caused the mounts to be unexpectedly unmounted and caused an outage.
If you want a way to reproduce this, open a shell via teleport, use mount
to mount a filesystem (in this case, it was ceph-fuse
, so a user process, but one that is entirely separate from its parent, so not a child, and one that remains if the shell is closed), you'll see that the process is actually killed if the teleport process is killed or the service stopped.
Adding the KillMode=process
to the teleport.service
did resolve this, but finding this issue open since 2018, I'm surprised that the KillMode hasn't been added yet to the service file in all this time, especially with since many people had production outages due to essential services getting killed in this way.
This issue has been raised for a long time, but it has never been merged... I don't know why the authorities haven't dealt with it yet, this seems like a time bomb to most users...
So, KillMode=process is a workaround, but not the best solution. (Note: KillMode=control-group is the default setting).
Recreate the issue:
Example:
What is happening: Normally, when a user logs in via SSH and starts this process, it is placed in the user.slice cgroup, but when they login with teleport, it is placed in the teleport.service cgroup. Therefore, when teleport is restarted, systemd kill is sending SIGTERM to all processes in the teleport.service cgroup because of the default setting mentioned above.
The better solution would be to start up the impacted process in a different cgroup than teleport.service. I have a query into Teleport support for recommendations, or documentation, on how best to do this.
Teleport support asked us to enable PAM integration and that fixed the issue for us.
https://goteleport.com/docs/enroll-resources/server-access/guides/ssh-pam/
What happened:
All child processes are killed when the teleport is restarted.
What you expected to happen:
Does not kill the child process after restarting Teleport.
How to reproduce it (as minimally and precisely as possible):
Start teleport with systemd, run
redis-server redis_example.conf
command to start redis, and restart teleport, the redis-server process is killed.Environment:
teleport version
): Teleport v3.0.1 git:v3.0.1-0-g4ff9a7b0tsh version
): Teleport v3.0.1 git:v3.0.1-0-g4ff9a7b0Browser environment
Relevant Debug Logs If Applicable
I have found the cause of the problem. The systemd service file of the example is not configured with the
KillMode
parameter; is this a bug? I will submit a PR fix.About systemd.kill: https://www.freedesktop.org/software/systemd/man/systemd.kill.html