Launching grains is slow

sandstorm-io / sandstorm

Sandstorm is a self-hostable web productivity suite. It's implemented as a security-hardened web app package manager.

https://sandstorm.io

Other

6.74k stars 706 forks source link

Launching grains is slow #2975

Open zenhack opened 7 years ago

zenhack commented 7 years ago

This was just asked about on IRC, and I noticed it before too: launching grains is unreasonably slow. For concrete numbers: my janky todo app (https://github.com/zenhack/yata), when run outside of the sandbox, starts instantaniously. In contrast, clicking on it in the grain list, while sitting about 15 feet from the server on my laptop (on wifi) takes 2-3 seconds before the UI appears. I remember there being some discussion about this on the mailing list wrt davros way back:

https://groups.google.com/forum/#!searchin/sandstorm-dev/davros$20startup|sort:relevance/sandstorm-dev/-mncsxPR7Rg/o3DHo_ynAgAJ

At the time we were working under the assumption that davros was at fault, but I suspect that is not the case, given that startup times outside the sandbox are dramatically faster. In the case of the app I linked above, it's basically just opening a sqlite database and then listening on a port; this takes almost no time outside the sandbox, and it seems unreasonable that it should take seconds within.

I've also noticed that it does get worse on worse internet connections, I think disproportionately to the decrease in overall network performance (but I'd have to do more careful measurements to be sure).

For reference, here is the discussion on IRC:

https://botbot.me/freenode/sandstorm/2017-08-18/?msg=90006029&page=1

amenonsen commented 7 years ago

I also see the same problem on my new sandstorm installation. I would be happy to provide any diagnostic information that could help understand the problem.

kentonv commented 7 years ago

In an ad hoc test I did just now, a new YATA instance starts up in ~1.7 seconds on Oasis and ~0.5 seconds running locally. I can't seem to reproduce the issue @zenhack describes.

That said, it's definitely true that many Sandstorm apps start up pretty slowly. But this isn't because the server is running any slower. Sandstorm's approach to sandboxing has almost zero overhead in terms of server performance.

Typically, the problem is some combination of:

App servers or server frameworks that are not designed with fine-grained containerization in mind, and so historically haven't had any pressure to optimize startup times. Etherpad, for example, is pretty slow to start for no really good reason -- it's just doing a bunch of stuff it shouldn't need to. Rocket.Chat is ridiculously slow, taking some 30-40 seconds before it will respond to a request. The way I'd like to solve this (other than optimizing every app individually) is with a checkpoint-restore approach based on snappy-start. But, that's a major project for which I have not yet had time.
Very large client asset bundles, which have to load fresh for every grain. Currently there's no mechanism by which static assets can be cached and reused across grains, so the client has to reload them every time. Worse, the bundle can't even start downloading until the grain server has started, so this stacks with the first problem. I'd like to develop a mechanism by which apps can specify static assets that are served directly by Sandstorm in a way that can safely be cached across grains and can load in parallel with the grain server. This is a comparatively smaller project than snappy-start on the Sandstorm end, but every app will likely have to be updated to integrate with any such system.

zenhack commented 7 years ago

Quoting Kenton Varda (2017-08-26 18:53:10)

Typically, the problem is some combination of:

Yeah, that's its own issue -- I referenced YATA because it's a good baseline for removing this kind of overhead from the equation. I did some local measurements of the app's startup time outside of the sandbox:

[isd@rook yata]$ export DB_PATH=my-db.sqlite3
[isd@rook yata]$ cat waitstart.sh
while true; do
        curl http://localhost:8080 > /dev/null 2>/dev/null
        [ "$?" = 0 ] && exit 0
done
[isd@rook yata]$ ./app & time ./waitstart.sh
[1] 2728

real    0m0,184s
user    0m0,095s
sys     0m0,046s
[isd@rook yata]$ fg
./app
^C
[isd@rook yata]$ ./app & time ./waitstart.sh
[1] 2756

real    0m0,022s
user    0m0,009s
sys     0m0,007s
[isd@rook yata]$ fg
./app
^C

The first run is with no pre-existing database, so it's a bit slower. But I see the same slowdown on sandstorm regardless of whether it's a first boot or opening an existing grain. At 20 ms to being ready to serve a page, I shouldn't be able to perceive anything -- it can't be the app.

The above is running locally on my laptop, and the delay running in sandstorm (also on my laptop) is about 1 second (as opposed to 2-3 on my server in the other room). But that's still a 50x slowdown in the sandbox vs. out of it.

I doubt the sandboxing itself is what's causing the problem, but it seems unlikely that everything can be blamed on the apps. I will try to find some time this week to dig in and see what's going on.

kentonv commented 7 years ago

My previous measurements were based on holding a stopwatch, so they accounted for human cognition delay.

If I look just at the Chrome devtools network panel, I get time-to-fist-byte of ~230ms for a new YATA grain, ~130ms for an existing grain. The latter seems to be independent of whether the grain was already running.

On Oasis, I'm seeing an existing grain TTFB is 450ms, and a new grain is 625ms (assuming the app is cached on the workers -- pulling from cold storage can add a second or two). About 200ms-300ms of this is DNS + TCP + TLS for the newly-created subdomain. Meanwhile there are three other network round-trips needed on grain load, and my RTT to Oasis is 60ms. So in this case the time is almost entirely explained by network round trips. Conceivably we could find ways to eliminate a round trip or two.

Now I think I have an explanation for your observations on your local server: where is your DNS? My guess is that when you're seeing a multi-second startup time for YATA, it's almost entirely DNS lookup time, and your DNS is remote. I have a local DNS server for my local Sandstorm instance so it's roughly instantaneous for me.

In any case, I think we might be looking at maybe 100ms of Sandstorm bookkeeping overhead (maybe some Mongo queries, etc.), which parallelizes with three network round trips. We could probably reduce either of those numbers a bit with some optimizations. But I don't think this is the real problem with Sandstorm app startup times. If every app started as fast as YATA I think everyone would be very happy. The real problem is the multi-second startups of more bloated apps.

zenhack commented 7 years ago

The dns issue had occurred to me; setting up a local one and comparing is on my todo list. I'm using sandcats for dns. dig tells me I'm getting response times of < 50 ms, so I'm skeptical, but I'll sit down and test soonish.

zenhack commented 7 years ago

Okay, yeah, setting up dnsmasq on my machine and having it handle requests for the sandstorm box's domain speeds things up substantially. 200ms for the local system, around 1 second for the machine in the other room (measured via the firefox dev tools). The latter still seems longer than it ought to be given that we're talking about wifi to a machine in the next room, but it's at least well within the not-annoying range.

The motivator for this though is actually my phone on LTE, which takes much longer, even when the signal is such that loading times for e.g. zenhack.net are still imperceptable. I can't convienently set up a custom DNS resolver on my phone to handle things locally, (a) because sandcats dns is dynamic, so it would break when my IP changed, and (b) just because doing that on a phone is a bit annoying (though I could figure something out if it were critical).

A few seconds on top of sandstorm itself is enough to make me think twice about bothering to open up my phone to jot down a todo item (half the reason I wrote YATA was that simple TODOs was even worse, and it is a big improvement).

It occurs to me that using per-session per-grain domains is going to defeat DNS caching, at least if we're just responding per-domain. I have heard that there are some significant problems re: compatiblity with wildcard domains, but I don't know just how bad they are/how widely supported they are anyway. One thought is to get sandcats to supply a wildcard record to the DNS client.

kentonv commented 7 years ago

AIUI, there's actually no such thing as a "wildcard record" in the DNS protocol. Rather, configuring a wildcard causes the server to respond to all matching requests in the same way. It's entirely up to the server to implement the matching.

There IS such a thing as wildcard TLS certificates, but that's a different matter.

We could "fix" the slow-DNS problem by "pre-allocating" hosts: When you open Sandstorm, it could randomly-generate some hostnames client-side and fire off dummy requests to them, to force DNS lookup and even TLS negotiation to complete. Then upon opening a session, the client could request that the server assign a particular hostname. I think it's fine, security-wise, to allow this -- a client who chooses a non-random hostname would only be hurting themselves.