microsoft / node-pty

Fork pseudoterminals in Node.JS
Other
1.46k stars 235 forks source link

POSIX Spawn 2: Electric Boogaloo #487

Closed Eugeny closed 2 years ago

Eugeny commented 3 years ago

Here's another go at posix_spawn, with a helper handling cwd/uid/gid/CT/setuid and then exec'ing itself into the new process.

jerch commented 3 years ago

Wow, thats like a full rewrite of the whole process setup handling in node-pty, removing the fork stuff. :+1:

The bad news are that posix_spawn() is still taking 300ms here

Hmm thats weird, any idea yet why? From what I've read, the problem with fork is the sandboxing on BigSur, where the memory transfer to the child process takes very long. In theory this problem should be gone with posix_spawn, since the process gets not clone-copied from the parent anymore. Is that actually faster than fork? Is the slowdown related to the helper in between (is the direct spawning much faster)?

Lemme me know if you need some help or when to review, can at least assist for linux and freebsd to some degree.

Eugeny commented 3 years ago

Thanks for the review - I'll go through it tomorrow. It's still 2x faster than fork but I've expected it to be way faster than that. I've profiled it and it's posix_spawn itself taking these 300ms, even with eg. /bin/df as file. I've just really didn't want to parse argv in the helper with my rusty C hence the socket stuff :D It was DGRAM based initially but can be replaced with a pipe now.

jerch commented 3 years ago

Thanks for the review - I'll go through it tomorrow. It's still 2x faster than fork but I've expected it to be way faster than that. I've profiled it and it's posix_spawn itself taking these 300ms, even with eg. /bin/df as file.

I see, well thats unfortunate. Is there a mystic flag or something else, that can be set to speed up things?

I've just really didn't want to parse argv in the helper with my rusty C hence the socket stuff :D It was DGRAM based initially but can be replaced with a pipe now.

Well you'd have to parse the numbers back for gid/uid back, but thats quite easy. You dont need to decompose the whole argv thingy, for exec you can simply pass it on, e.g. execvp(argv[2], &argv[2]) (if the command starts in ~second~ third position).

jerch commented 3 years ago

Btw solaris does not know the TIOCSCTTY ioctl. To properly attach a controlling terminal there, it is needed to do an open call to it, e.g. something like this (error handling skipped):

#if defined(TIOCSCTTY)
  ...
#else
  char *slave_path = ttyname(STDIN_FILENO);
  // open implicit attaches a process to a terminal device if:
  // - process has no controlling terminal yet
  // - O_NOCTTY is not set
  close(open(slave_path, O_RDWR));
#endif
Eugeny commented 3 years ago

I think it's worth considering, as the next step after this, doing the spawn itself in an async worker, to at least avoid locking up the Electron's UI thread.

Eugeny commented 3 years ago

Please take another look at it when you'll have time - I think this is as clean as it gets.

Still needs testing on Linux and other UNIXes that I don't have access to.

Ideally I would love this to get merged first and then work on an async spawn() separately

jerch commented 3 years ago

Looks pretty cleaned up at a first glance, will review and test it later on under Ubuntu 18, FreeBSD 12 and OpenIndiana (solaris clone), which should cover most prominent POSIX systems. Dont have access to OpenBSD, NetBSD or AIX (to name a few more that are less common, not even sure if those have a working nodejs port).

mofux commented 3 years ago

@Eugeny Do you know if the new spawn helper will have to be manually code-signed on OSX when packaging an Electron App? I'm a little bit concerned the extra helper may complicate thinks in that matter.

Eugeny commented 3 years ago

@mofux yes, but existing native modules already need to be signed anyway. Good news is that codesign takes care of this automatically when signing an app bundle

Eugeny commented 3 years ago

@jerch re performance: turns out, this is due to file closing. Removing the fd closer loop gets it down to the same 1-3 ms as fork on my machine.

Eugeny commented 3 years ago

All done, perf is back to normal with the tradeoff of keeping FDs open. Error handling is revamped and passes errno back to the parent. Also added POSIX_SPAWN_USE_VFORK for a slight perf boost when available

Eugeny commented 3 years ago

After loading some large modules and allocating & filling a 100 MB buffer first:

(all of the performance numbers here and above are on Linux)

jerch commented 3 years ago

Oh well, yeah then lets not go with that security measure by default.

I suggest to keep it as configurable option, so ppl can opt-in here under more dangerous situations. Maybe with an additional check whether POSIX_SPAWN_CLOEXEC_DEFAULT was defined (assuming we dont need to loop-close FDs if the system already did it for us).

The danger is real, see https://labs.portcullis.co.uk/blog/exploiting-inherited-file-handles-in-setuid-programs/. And such a dangerous situation can easily be made up - imagine some terminal hub service, that pulls the authentication/authorization from some DB socket in the parent process, then uses node-pty to create ptys with dropped privileges - boom, the DB socket got leaked possibly open for shady manipulations on the auth parts.

Eugeny commented 3 years ago

I'm a dumbass who doesn't know how to benchmark stuff. Only the first spawn() takes 300 ms on macOS, subsequent spawns are all <5 ms.

Looks like an async spawn() won't be necessary after all!

Eugeny commented 3 years ago

Released it in the latest Tabby beta - let's see how it goes.

jerch commented 3 years ago

@Eugeny Sorry could not yet test it on FreeBSD, but I expect no bigger surprises there (the pty impl is very similar to linux). And it works really nice and fast on linux.

At first I was a bit unsure about the VFORK option, since it is blocking and sharing all memory with the parent, but after some reading it is prolly safe for the instant exec done by posix_spawn, and alot faster.

Eugeny commented 3 years ago

No problem. vfork() + exec() seems to be the recommended way to spawn processes in Linux, and although there's only a limited set of calls that are allowed to happen after a vfork, it should be safe to assume that glibc does it right.

jerch commented 3 years ago

@Eugeny FYI - freebsd tested - works. (Well, some tests fail for other reasons, but not from your code...)

Edit: Almost forgot to mention - freebsd does not include errno.h from any of the other headers in spawn_helper, thus needs

#include <errno.h>

explicit in spawn-helper.cc.

Eugeny commented 3 years ago

Done.

Eugeny commented 3 years ago

Let me know if anything else needs to be done to get this merged 🙌 Tabby v155 reports 421 active users on Mac and Linux, no bug reports and nothing coming in on Sentry.

jerch commented 3 years ago

@Eugeny You get a :+1: from my side.

@Tyriar Any chance to look at this soon? Imho it affects quite some peeps...

Tyriar commented 2 years ago

@deepak1556 could you review this when you get a chance?

deepak1556 commented 2 years ago

@Eugeny @jerch Thanks for working on this change! I apologize for my delay in review, will test and add my comments this week.

jerch commented 2 years ago

Isn't this embarrassing? A lot of ppl suffer from this, still this PR is not worth a ~15min review/feedback. Guys if you dont feel like maintaining this lib in a healthy state - just say so. Pretty sure there are enough ppl out there to take ownership in a responsible way, even without that CLA nonsense. @Tyriar Told you more than once, that this lib has serious I/O issues, but nothing ever happened. Taking responsibility for OSS repos is not just about "hey it works for our vscode use case, so all fine" - nope, for infrastructure critical libs like this one it has to fit the broader picture dealing with all interface assumptions the underlying OS does. The fact, that this lib swallows I/O if used wrongly (mind you, that wrong usage means any command that does not enter a REPL loop itself within PIPE_BUF, e.g. not a shell), is just a big fault. Now ppl get a perceivable degradation on BigSur+, still nothing happens. Where is the bar for you to take action for real?

deepak1556 commented 2 years ago

@jerch not sure if you are following the upstream libuv refactor for posix_spawn which was the original inspiration for the suggestion of this change in the module https://github.com/microsoft/node-pty/issues/476. The PR also will suffer from the kernel bug mentioned in https://github.com/libuv/libuv/pull/3257#issuecomment-900707276, I am waiting to follow what libuv does before taking any further action here from my side. Also to clarify, degradation should be perceived on Big Sur+ only if the module gets code signed, it doesn't happen in other cases as outlined in https://github.com/libuv/libuv/issues/3050

I don't have any additional code reviews for this PR at the moment, I will let you and @Tyriar make the decision of merging the PR in its current state.

jerch commented 2 years ago

@deepak1556 Thx for the update on the issues behind, thats at least some information to work with (and no, I dont follow any upstream discussion on foreign repos). Itsn't a pity that I have to get cocky to get that information? Anyway, thx for the quick response.

deepak1556 commented 2 years ago

Sorry about that, I could have done a better communication from my side beforehand. I mostly get tied up with other core work that this kept falling through the cracks but definitely not a valid reason for slow response on this PR that is open for quite sometime.

jerch commented 2 years ago

I want to apologize for my raging outburst above, which came out way too directed at @Tyriar. There is nothing to defend my words, in fact @Tyriar does a great job and is known for having moved things by big leaps in the past. Many thanks for that.