gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.69k stars 1.77k forks source link

All child processes are killed when the teleport is restarted! #2355

Open mritd opened 6 years ago

mritd commented 6 years ago

What happened:

All child processes are killed when the teleport is restarted.

What you expected to happen:

Does not kill the child process after restarting Teleport.

How to reproduce it (as minimally and precisely as possible):

Start teleport with systemd, run redis-server redis_example.conf command to start redis, and restart teleport, the redis-server process is killed.

Environment:

Browser environment

Relevant Debug Logs If Applicable

I have found the cause of the problem. The systemd service file of the example is not configured with the KillMode parameter; is this a bug? I will submit a PR fix.

About systemd.kill: https://www.freedesktop.org/software/systemd/man/systemd.kill.html

andrewbanchich commented 6 years ago

I think I've run into this as well. I tried a few times to execute something like systemctl stop teleport && another thing && another thing && systemctl start teleport and it always just disconnected and never came back up for me.

mritd commented 6 years ago

Due to this problem, on an unfortunate morning, I killed the production environment redis server...

image

butsjoh commented 5 years ago

We experienced the same issue as well. Setting KillMode seems to be the solution. We still like to know if the attached PR is going to be merged or not and if setting the KillMode is a problem or not for teleport.

butsjoh commented 5 years ago

If somebody as well can explain why some processes get grouped under the teleport process (probably something with control groups and systemd) that would be useful :)

cove commented 5 years ago

Does this happen with OpenSSH too? Starting a process in your terminal and then logging out should cause that process to get a SIGHUP, so the bug here seems to be that Teleport didn't kill the process on logout but when Teleport was restarted.

If you want to start a process in your terminal and have it continue to run, you'd use nohup to start it, or do a exec $SHELL just before to orphen your process group. Otherwise you should expect the process to be killed on logout.

butsjoh commented 5 years ago

Most systemd files for openssh have the KillMode=process hence the question if teleport should not have it as well since it acts the same. I understand that you need to start your processes so they get daemonized somehow. But in our case we use a deploy tool (capistrano) which boots up the processes for our rails app. Al fine and dandy until we upgrade teleport and do service teleport restart. All of a sudden all the processes that where started using the deploy tool stopped working. After we set the KillMode=process that does not longer happen. We have seen this to happen mostly with processes that use a socket (god, puma, ...).

What we do not understand (and that is not teleports fault probably we are laking unix knowledge) is why the processes that we started using the deploy tool are failing under the teleport process if we restart it. If we do a service teleport status -l we see al our processes hanging under the main teleport process. If you or somebody else can enlighten us that would be great.

But for now we set the KillMode=process to avoid the issues mentioned above and maybe the PR is really valid and should be merged.

cove commented 5 years ago

If you have to do it with OpenSSH and folks are modifying those configs too, then that would suggest the traditional behavior of all OSs. When you logout it needs to clean up your session. Though it sounds like Teleport might not be sending a SIGHUP or it’s possible that’s the responsibility of the shell in some cases.

If it doesn’t do that, then it’s leaking ttys I suspect and possibly there are zombies or other processes left running that could eventually exhaust some resource. So OSs need to clean up sessions by default to avoid those issues, not just Unix.

I realize that doesn’t address your issue here, but it sounds like it’s a bug in the Capistrano script. All you would need to do is place a nohup before the command it’s running and that should solve it in theory. Have you tried that?

On Thu, Jan 31, 2019 at 13:25 Buts Johan notifications@github.com wrote:

Most systemd files for openssh have the KillMode=process hence the question if teleport should not have it as well since it acts the same. I understand that you need to start your processes so they get daemonized somehow. But in our case we use a deploy tool (capistrano) which boots up the processes for our rails app. Al fine and dandy until we upgrade teleport and do service teleport restart. All of a sudden all the processes that where started using the deploy tool stopped working. After we set the KillMode=process that does not longer happen. We have seen this to happen mostly with processes that use a socket (god, puma, ...).

What we do not understand (and that is not teleports fault probably we are laking unix knowledge) is why the processes that we started using the deploy tool are failing under the teleport process if we restart it. If we do a service teleport status -l we see al our processes hanging under the main teleport process. If you or somebody else can enlighten us that would be great.

But for now we set the KillMode=process to avoid the issues mentioned above and maybe the PR is really valid and should be merged.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/gravitational/teleport/issues/2355#issuecomment-459512102, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlXzO6NPP8nVoQbZIjnc_eCPDkVPXp8ks5vI19EgaJpZM4YZEsh .

butsjoh commented 5 years ago

Thnx for more clarification. I will try the recommendation of putting nohup in front of it. Like i said i am not an expert on the matter. I am just trying to understand what is going on. If popular tools like redis get killed after opening up a teleport/ssh session and keep working as long as you do not restart teleport/ssh then that is at least something people should understand. Maybe systemd is more of the thing that cause the problem in our situation.

butsjoh commented 5 years ago

Ok after reading some more i see why you suggest using nohup. But i am still not understanding something. Is my assumption correct if a teleport session would end (by exit or disconnect) that it will cleanup all running processes? Then it must be related to how systemd handles how processes get killed when a session would end. I am curious to see what would happen if i would would NOT run the teleport binary through systemd and open up a session and start some processes and then exit. What would be the expected behaviour or teleport?

mritd commented 4 years ago

Is there a solution to fix this problem?

I don't think we can simply compare teleport with openssh;

Need to be clear: openssh is a system service, we rarely touch it during normal operation ... teleport is a third-party software, we will perform daily upgrade and maintenance on it to ensure that it works properly and follow the official version upgrade.

From a security perspective, the purpose of our use of teleport is to audit the operation of the user's remote login. The actual problem caused us to maintain an audit tool and caused the abnormal termination of the user program.

webvictim commented 4 years ago

As a workaround for now, you can set KillMode=process in your systemd unit file for Teleport. This should enable sshd-like behaviour where processes started via Teleport sessions will stay running after Teleport exits.

I will look into ways we can fix this more permanently.

mritd commented 4 years ago

Hey man, it's been over a year ...If setting KillMode=process is the correct solution, should it be added to the document to avoid more people experiencing the same problem?

webvictim commented 4 years ago

If it is the correct solution, then yes. That’s what we need to determine.

webvictim commented 4 years ago

Does anyone have a working reproduction for this issue on a modern (~4.3) version of Teleport? I've been trying to reproduce this problem, but can't make it happen.

lamhoangtung commented 3 years ago

@webvictim Our node running on teleport 6.2.7 with PIDFile=/run/teleport.pid still having this problem.

The KillMode=process fixed it for now

Zaephor commented 1 year ago

Had a similar issue in a production environment with v9 and v10, environment was Ubuntu 18.04 with our own systemd file. Will research KillMode=process to be thorough, but we also had success resolving it by enabling the PAM integration configuration to the edge nodes.

ssh_service:
  pam:
    enabled: true
    service_name: "sshd"

Our clue to this idea was finally realizing that some of our shell environment scripts/vars that get called on ssh login were not populating on teleport login either.

kakaroto commented 1 year ago

Does anyone have a working reproduction for this issue on a modern (~4.3) version of Teleport? I've been trying to reproduce this problem, but can't make it happen.

I can confirm. I've had this issue on production servers just yesterday, using teleport v13.2.3. Due to some maintenance we'd done a few months ago, we had some production-critical filesystems mounted via teleport. Doing a restart of the teleport process (I was reinstalling it on a few servers via ansible), caused the mounts to be unexpectedly unmounted and caused an outage.

If you want a way to reproduce this, open a shell via teleport, use mount to mount a filesystem (in this case, it was ceph-fuse, so a user process, but one that is entirely separate from its parent, so not a child, and one that remains if the shell is closed), you'll see that the process is actually killed if the teleport process is killed or the service stopped. Adding the KillMode=process to the teleport.service did resolve this, but finding this issue open since 2018, I'm surprised that the KillMode hasn't been added yet to the service file in all this time, especially with since many people had production outages due to essential services getting killed in this way.

mritd commented 1 year ago

This issue has been raised for a long time, but it has never been merged... I don't know why the authorities haven't dealt with it yet, this seems like a time bomb to most users...

shannonpeeveyunlv commented 4 months ago

So, KillMode=process is a workaround, but not the best solution. (Note: KillMode=control-group is the default setting).

Recreate the issue:

  1. User logs in via teleport
  2. User starts a process, in our case it was vendor software.
  3. Run systemd-cgls and check to see if the new process spawned in the teleport.service control group (cgroup).

Example: image

What is happening: Normally, when a user logs in via SSH and starts this process, it is placed in the user.slice cgroup, but when they login with teleport, it is placed in the teleport.service cgroup. Therefore, when teleport is restarted, systemd kill is sending SIGTERM to all processes in the teleport.service cgroup because of the default setting mentioned above.

The better solution would be to start up the impacted process in a different cgroup than teleport.service. I have a query into Teleport support for recommendations, or documentation, on how best to do this.

shannonpeeveyunlv commented 3 months ago

Teleport support asked us to enable PAM integration and that fixed the issue for us.

https://goteleport.com/docs/enroll-resources/server-access/guides/ssh-pam/