Add prewarm option for additional ssh kitten connections

groves commented 2 years ago

Is your feature request related to a problem? Please describe. It takes about 200 milliseconds to ssh into a host I'm using with the ssh kitten. That's the average value I get running time kitty +kitten ssh jojr true. I get an average of ~30 millis from time ssh jojr true, which also has ControlMaster on. That additional 170 millis is a noticeable pause when I pop open a new window with launch --cwd=current or similar commands.

Describe the solution you'd like When the ssh kitten finishes connecting, it launches another ssh kitten in the background. The next request for an ssh kitten to that host uses that already connected kitten. That should make it appear that connecting is instantaneous. Once that connection is in use, another background kitten could be launched to prewarm for the next request.

I'd think this should be behind an option and off by default. It would mean that any files or configuration that have changed since the previous invocation that would affect the kitty ssh bootstrap would be out of date, and I think users would want to be aware of that before enabling this. It would also need a way to update the cwd of the remote connection right before handing over the prewarmed connection since that will likely have changed since the bootstrap was run.

Describe alternatives you've considered I did some profiling of the ssh kitten to see if I could speed it up enough to make it not an issue. I added time.monotonic calls to the start of execution in ssh/main.py and to right before ssh/main.py executes ssh. Those showed that it took 100 millis of Python execution to get to the start of ssh/main.py's code and another 50 millis to get to executing ssh. Since over half the additional time taken by the ssh kitten over raw ssh was in Python startup and importing, I didn't think I'd be able to optimize the ssh kitten enough to make a difference.

I've looked at using ssh directly myself, but that means reimplementing the cwd support of launch as far as I can tell.

Additional context If this sounds like a reasonable thing to add, I'd be happy to start on a PR for it. I'm also happy to work on other approaches if there's a better way to go about making additional ssh connections fast.

kovidgoyal commented 2 years ago

How are you getting 100ms and 50ms that does not sound correct at all. For instance running

time kitty +kitten ssh -h

gives

real 0.064 user 0.054 sys 0.010

aka 64 ms for running main and executing ssh itself. I doubt my laptop is really twice as fast as your machine. You can try running with PYTHONPROFILEIMPORTTIME=1 to see whats causing the delays.

Adding a return to line 702 so that ssh is not actually called but all processing is done gives me time kitty +kitten ssh localhost real 0.093 user 0.075 sys 0.017

Finally: time kitty +kitten ssh localhost true

real 0.132 user 0.084 sys 0.013

And python 3.11 should reduce these times by about 20%

groves commented 2 years ago

How are you getting 100ms and 50ms that does not sound correct at all.

Indeed it wasn't correct. Thanks for catching that.

I was running with KITTY_DEVELOP_FROM set to let me change the Python code. If I don't have it set, I get more like 80 millis for a no-op ssh invocation:

groves@baladi ~ [255]> time kitty +kitten ssh -V
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-B bind_interface]
           [-b bind_address] [-c cipher_spec] [-D [bind_address:]port]
           [-E log_file] [-e escape_char] [-F configfile] [-I pkcs11]
           [-i identity_file] [-J [user@]host[:port]] [-L address]
           [-l login_name] [-m mac_spec] [-O ctl_cmd] [-o option] [-p port]
           [-Q query_option] [-R address] [-S ctl_path] [-W host:port]
           [-w local_tun[:remote_tun]] destination [command]

________________________________________________________
Executed in   86.01 millis    fish           external
   usr time   67.78 millis    0.37 millis   67.42 millis
   sys time   14.49 millis    1.08 millis   13.41 millis

I'm guessing the frozen stuff from bypy is making a big difference there?

Without the KITTY_DEVELOP_FROM penalty, I still see a noticeable slowdown from the ssh kitten as opposed to direct ssh. Here's the kitten's timing, which is with the shared session established and is representative from several runs:

groves@baladi ~> time kitty +kitten ssh localhost true

________________________________________________________
Executed in  258.17 millis    fish           external
   usr time   97.41 millis    0.39 millis   97.02 millis
   sys time   18.91 millis    1.16 millis   17.75 millis

And here's direct ssh timing:

groves@baladi ~/.ssh> time ssh localhost true

________________________________________________________
Executed in   42.21 millis    fish           external
   usr time    3.12 millis  365.00 micros    2.76 millis
   sys time    4.12 millis  885.00 micros    3.23 millis

I'm doing things like popping a file fuzzy finder on the remote host from a kitty keybinding, so the slowness is particularly noticeable as it's in my editing flow. I'm still motivated to speed this up as a result, either via the prewarming or some other mechanism.

kovidgoyal commented 2 years ago

Well, it is always going to be slower than plain ssh+controlmaster since it has to do a lot more work both before invoking ssh and after invoking but before the remote command is executed.

You could possibly get rid of the 60ms python startup+import time by keeping a process waiting around but this seems like overkill. 250ms isnt very long when connecting to a remote host and if it is then shaving 60ms of it isnt going to help much. For your use case I would just make a dedicated mapping that runs either plain ssh or a kitten to get the controlmaster port and then run plain ssh.

groves commented 2 years ago

Hrmm, but if we kept an ssh connection waiting around, wouldn't that eliminate the entire 250ms? That's what makes it seem worthwhile to me. I agree that it's overkill for 60ms.

kovidgoyal commented 2 years ago

On Thu, Jun 02, 2022 at 10:02:17AM -0700, Charlie Groves wrote:

Hrmm, but if we kept an ssh connection waiting around, wouldn't that eliminate the entire 250ms? That's what makes it seem worthwhile to me. I agree that it's overkill for 60ms.

You cant keep an ssh connection waiting around unless you are willing to eliminate following cwd. The best you can do is have the kitten wait just before building the tarball to send.

kovidgoyal commented 2 years ago

Thinking about it some more there is a prewarm I would be willing to add to kitty, but one that is generic for all kittens not specific to the ssh kitten. Basically, the idea would be to keep a background process that finishes all python initialization and imports most commonly used modules and then waits. When running kittens kitty will then pass the kitten envvars, cli, cwd and stdin to this waiting process and it will run the kitten by forking. This avoids the entire python initializationa nd import time overhead for all kittens.

The slightly tricky part would be to successfully replace the stdio handles in the already initialized python. EDIT: Thankfully, os.dup2() works fine for this.

If you want to help with this let me know.

groves commented 2 years ago

Sure, I'd be interested. Would you like me to take a swing given your general idea of swapping in a prewarmed python with dup2, or would you like to see more details before I throw some code together?

kovidgoyal commented 2 years ago

No need I have already mostly implemented it in master. Feel free to kick the tires once its done.

kovidgoyal commented 2 years ago

prewarming for all kittens is now fully implemented in the prewarm branch. It works when runnign any kitten via keybindings or remote control. Note the very first time you run a kitten in a given kitty instance there wont be any benefit, after that prewarming will kick in.

kovidgoyal commented 2 years ago

And I have now removed the limitation that the first kitten invocation is not prewarmed. It's now merged into master and will be in the next release.

groves commented 2 years ago

I piped kitty --debug-keyboard --dump-commands into a script that prepends each line with the process time in nanos. In the spawned kitty, I hit the "new window" keyboard shortcut in a window running the ssh kitten. I then grabbed the time for handled as shortcut being printed and the time for the final draw command for my prompt on the remote host.

That took 166 millis on average with the kitty nightly and 202 millis in the released version. 36 millis faster, so seems like the prewarming is working to me. Thanks for implementing it!

Do I understand correctly that for my custom kittens, they'll use a prewarmed process if launched via shortcut or remote control but that it won't have their custom code imported? They seem to be working fine, just making sure that they're actually affected and worth testing as part of this.

Anything else you'd like me to try?

kovidgoyal commented 2 years ago

On Tue, Jun 14, 2022 at 02:25:13PM -0700, Charlie Groves wrote:

I piped kitty --debug-keyboard --dump-commands into a script that prepends each line with the process time in nanos. In the spawned kitty, I hit the "new window" keyboard shortcut in a window running the ssh kitten. I then grabbed the time for handled as shortcut being printed and the time for the final draw command for my prompt on the remote host.

That took 166 millis on average with the kitty nightly and 202 millis in the released version. 36 millis faster, so seems like the prewarming is working to me. Thanks for implementing it!

You're welcome, it was a fun technical challenge.

Do I understand correctly that for my custom kittens, they'll use a prewarmed process if launched via shortcut or remote control but that it won't have their custom code imported? They seem to be working fine, just making sure that they're actually affected and worth testing as part of this.

Yes, that's correct but importing one or two python modules is not slow, should be sub millisecond.

Anything else you'd like me to try?

Well I am working on an implementation of prewarming that also works for kittens launched at the shell inside kitty (think of icat or the diff kitten). If that works out, I will post here.

kovidgoyal commented 2 years ago

The prewarm implementation for kittens at the shell is now in master.

page-down commented 2 years ago

@kovidgoyal I gave this a quick test.

kitty @ ls | less
Press qq and (END)qq is displayed.
The shell displays fish: Unknown command: q after a carriage return, so it looks like less has exited on the first q.

How to fix this?

Would you mind adding KITTY_PREWARM_SOCKET_REAL_TTY to glossary.rst?

kovidgoyal commented 2 years ago

Should be fixed by: https://github.com/kovidgoyal/kitty/commit/f637bf2377b66a836b92d0049ea6d23d46217994

page-down commented 2 years ago

Should be fixed by: f637bf2

... In theory we might need to check STDIN as well but that is very unusual.

I noticed that the beginning of the remote control tutorial mentions ls | kitty @ send-text --stdin, and the following does not work properly.

ls | fzf | kitty @ send-text --stdin --match ...

kovidgoyal commented 2 years ago

I doubt that would work without socket prewarming either. If it does it would be very accidental. In general you cannot run multiple programs in a pipeline that expect to do terminal IO. Here both fzf and kitty @ are expecting to do terminal IO, simultaneously. Indeed, your initial example of kitty @ ls | less only works because less doesn't do much until it starts receiving data and kitty @ ls only sends data after it has finished doing terminal IO. In such cases you can always do

whatever | KITTY_PREWARM_SOCKET= kitty @ send-text

kovidgoyal commented 2 years ago

And with this commit: https://github.com/kovidgoyal/kitty/commit/609d42e2bca742613ccdccbd2152e835cd8a19ee on my linux box at least even using fzf works, but I should emphasize, this is very much a hack, you cant really have multiple programs thinking they own the tty at the same time.

page-down commented 2 years ago

... you cant really have multiple programs thinking they own the tty at the same time.

Thanks for the explanation, these two programs will work simultaneously and will cause problems. In this case, fzf is not aware of the existence of kitty @. If a program that owns tty wants to call another program that does some terminal io, it should be handled properly.

For fzf, I guess the following will work.

ls | fzf --bind 'ctrl-y:execute(printf {} | kitty @ send-text --match recent:1 --stdin)'

page-down commented 2 years ago

@kovidgoyal I noticed that when kitty.app is launched by OS (e.g. Launchpad), kitty +list-fonts crashes. kitty.app is built with debug and sanitize arguments.

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib              0x7ff81600cdba __abort_with_payload + 10
1   libsystem_kernel.dylib              0x7ff81600e877 abort_with_payload_wrapper_internal + 80
2   libsystem_kernel.dylib              0x7ff81600e827 abort_with_reason + 19
3   libobjc.A.dylib                     0x7ff815ed9aee _objc_fatalv(unsigned long long, unsigned long long, char const*, __va_list_tag*) + 114
4   libobjc.A.dylib                     0x7ff815ed9a7c _objc_fatal(char const*, ...) + 138
5   libobjc.A.dylib                     0x7ff815ecffa0 performForkChildInitialize(objc_class*, objc_class*) + 299
6   libobjc.A.dylib                     0x7ff815ebb177 initializeNonMetaClass + 617
7   libobjc.A.dylib                     0x7ff815ebaf6a initializeNonMetaClass + 92
8   libobjc.A.dylib                     0x7ff815ebac18 initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 232
9   libobjc.A.dylib                     0x7ff815eba995 lookUpImpOrForward + 1087
10  libobjc.A.dylib                     0x7ff815ebd53d object_setClass + 88
11  CoreFoundation                      0x7ff816070d78 _CFRuntimeCreateInstance + 544
12  CoreText                            0x7ff8177a67a0 TCFRef<CTFontDescriptor*> TCFBase_NEW<CTFontDescriptor, __CFDictionary const*&>(__CFDictionary const*&) + 26
13  CoreText                            0x7ff8177a674e CTFontDescriptorCreateWithAttributes + 48
14  CoreText                            0x7ff8177d00d0 CTFontCollectionCreateFromAvailableFonts + 91
15  fast_data_types.so                     0x1125c9675 all_fonts_collection + 21 (core_text.m:159)
16  fast_data_types.so                     0x1125ca026 coretext_all_fonts + 22 (core_text.m:165)
17  Python                                 0x11016f446 cfunction_vectorcall_NOARGS + 101
18  Python                                 0x11021bc75 call_function + 158
...
84  Python                                 0x110289921 pymain_run_module + 212
85  Python                                 0x1102894a9 Py_RunMain + 974
86  kitty                                  0x10fc1a6db run_embedded + 3627 (main.c:201)
87  kitty                                  0x10fc19412 main + 1954 (main.c:292)
88  dyld                                   0x11902352e start + 462

kitty(8137,0x10f439600) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
objc[8139]: +[UIFontDescriptor initialize] may have been in progress in another thread when fork() was called.
objc[8139]: +[UIFontDescriptor initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
zsh: abort      kitty +list-fonts

Also, due to recent changes, make again after kitty has been built will prompt an error.

...
  File ".../kitty/setup.py", line 1257, in create_minimal_macos_bundle
    os.symlink(
FileExistsError: [Errno 17] File exists: 'kitty.app/Contents/MacOS/kitty' -> 'kitty/launcher/kitty'

kovidgoyal commented 2 years ago

That crash implies that the forking of the prewarm process is happening after cocoa starts initializing, but if you see main.py the fork is done before glfw (and therefore any cocoa frameworks) is even loaded. Very weird. Does launch services somehow preload cocoa when running applications?

kovidgoyal commented 2 years ago

This should fix it: https://github.com/kovidgoyal/kitty/commit/5153d332e40d7b6be89542fbc1cb08270176dd87

page-down commented 2 years ago

This should fix it: ...

I compared show_kitty_env_vars and forgot that LANG has been modified under macOS, just as PATH has been. That's why I couldn't find the difference, before digging into main.py.

Running the command kitty/launcher/kitty +kitten/+runpy/... in the development directory shows that the code executed is from the current running kitty terminal. The reason for this is perhaps not so obvious for developers unfamiliar with kitty. Is it necessary to document in the dev page that KITTY_PREWARM_SOCKET should be unset (set to empty)? Hopefully that will reduce the number of issues related to this.

kovidgoyal commented 2 years ago

On Sun, Aug 14, 2022 at 10:49:46PM -0700, page-down wrote:

This should fix it: ...

I compared show_kitty_env_vars and forgot that LANG has been modified under macOS, just as PATH has been. That's why I couldn't find the difference, before digging into main.py.

Running the command kitty/launcher/kitty +kitten/+runpy/... in the development directory shows that the code executed is from the current running kitty terminal. The reason for this is perhaps not so obvious for developers unfamiliar with kitty. Is it necessary to document in the dev page that KITTY_PREWARM_SOCKET should be unset (set to empty)? Hopefully that will reduce the number of issues related to this.

You are welcome to send a PR for this. Though there isnt a dev page as such. Maybe in CONTRIBUTING.md

kovidgoyal / kitty

Add prewarm option for additional ssh kitten connections #5159