optimize app engine frontend instance count

snarfed commented 8 years ago

we average ~2 app engine frontend instances right now, and they cost money (at least above the first), so i've been experimenting with different tuning knobs to see if i can get that average down a bit. this isn't high priority, and i'm also not sure i can have much impact without making drastic changes, so this is mostly just for tracking.

directly related:

90 figure out why requests aren't running concurrently (enough)
110 optimize poll task

loosely related:

8 find an alternative to the resident backend for twitter streaming
53 get backend memory usage below cap
145 cache in instance memory as well as memcache
300 try dedicated memcache

here are a few things i've tried recently, and the results. tl;dr: none of them had a meaningful effect on instance count.

12/17 12:20pm: 30m fast poll frequency, poll concurrency 2
12/18 7:21am: 15m fast poll frequency, concurrency 1
12/18 10:40am: 15m fast poll frequency, poll concurrency 10

snarfed commented 8 years ago

the really annoying part is that we don't actually need more than one instance, but app engine's scheduler doesn't handle us well and thinks we do, so it usually keeps two or three around. from https://console.developers.google.com/appengine/instances?project=brid-gy&moduleId=default&versionId=7&duration=P1D :

this is the norm. one instance serves almost all of the requests; the remaining instance(s) serve almost none.

details on how app engine's request scheduler/auto scaler works in the links below. i've tried lots of different values (including default) for Max Idle Instances and Min Pending Latency, different instance classes, and other settings, but haven't managed to change this behavior. :/

basic and manual scaling are tempting, but only get 8 free instance hours per day, as opposed to automatic scaling's 28, so they're right out.

@kylewm and i talked a bit on IRC just now on profiling. our request volume is mostly poll and propagate (~87% and ~9% each, respectively), both of which i expect spend most of their time blocked on network I/O. good idea to check, though. app stats is great for that; we can turn it on here.

the poll task queue is currently configured to run up to 2 at a time, propagate up to 1 at a time, and we have threadsafe: yes in our app.yaml to allow instances to serve multiple requests concurrently. last time i checked with bridgy's instance info recording, this was indeed happening.

snarfed commented 8 years ago

for reference, here's bridgy's lifetime cost history. it breaks down to roughly 2/3 frontend instances, 1/3 datastore ops (reads/writes).

costs

snarfed commented 8 years ago

tried another experiment: removed max_concurrent_requests: 4 from the poll queue and let it run as many as once as it wanted. part of the motivation was to let backlogs clear faster and avoid telling users "next poll in 1m" for long periods of time, but also a (weak) hypothesis was that running more polls simultaneously would better utilize instances.

in practice, it didn't clearly help or hurt. i flipped the switch feb 10 ~11am. the main difference i see is that our active instance count was more jagged and variable because we weren't throttling polls (and thus load).

i also don't know why feb 9 evening averaged so low, but afaik it wasn't due to anything we did.

snarfed commented 8 years ago

oh, right, polls were unusually slow feb 9 evening, i suspect due to one of the silo APIs having issues and being slow itself. not sure why that would drop our average instance count, but it is the one clear correlation.

snarfed commented 8 years ago

i started experimenting with this again yesterday. coincidentally, poll latency also spiked a bit before i started experimenting, and has stayed pretty high.

the only change i made that could have caused that was deploying #620 and snarfed/granary@74aed5c7025d75e93dd59b959537023f50cae6be, both of which really seem unrelated. maybe i accidentally deployed a different version of some libraries in my virtualenv? (although deploy.sh should have caught that?) maybe some bad new interaction in app engine 1.9.33? i don't know. :(

snarfed commented 8 years ago

anecdotally, it seems like the cause might be slower HTTP fetches during OPD. i looked at a few poll logs and saw a number of >10s requests when fetching different people's web pages. :(

snarfed commented 8 years ago

here are the experiment results, even if they're tainted by the high latency.

all

the magic single-instance period starting at 10am was the first time with min_idle_instances=0, max_idle_instances=1, but the poll queue was caught up when that started, which i suspect mattered. otherwise, unclear.

snarfed commented 8 years ago

latency is slowly maybe getting back to normal.

snarfed commented 8 years ago

well that's a pleasant surprise.

instances

snarfed commented 8 years ago

definitely promising. looks like we may be able to hold sustain ~1.5 instances on average. we'll see.

snarfed commented 8 years ago

seeing substantial poll latency spikes again. anecdotally, it looks like many of the long requests are doing OPD refetch. filed #624 to optimize.

kylewm commented 8 years ago

huh, there's weird big gaps between RPCs. Not sure if that's the process being asleep, or if it's actually doing something processor intensive in that time.

pollling-profile

here's another one

another-profile

snarfed commented 8 years ago

whoa, those are big gaps indeed.15-25s?! i mean, I'm sure some of our compute is inefficient, but not that inefficient.

i guess we could try to repro it in dev_appserver, refetching a local web server, and comment out parts of the code in the gap and binary search down until we isolate it. maybe tedious, but could work.

snarfed commented 8 years ago

maybe it's context switching to other requests in other threads? hard to believe though. i'd think it should be doing that during the API calls, and i'd also hope no self respecting scheduler would starve a thread for 20s at a time.

maybe worth searching/asking on SO or https://groups.google.com/forum/m/#!forum/google-appengine ?

kylewm commented 8 years ago

I found one where the spacing is much more regular screenshot from 2016-02-26 07-55-32

After adding some more debug output, I'm finding that mf2 parsing regularly takes longer than the fetch itself

D 2016-02-26 16:24:34.896 fetching post permalink https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the
D 2016-02-26 16:24:35.248 done fetching. now parsing post for mf2 https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the
D 2016-02-26 16:24:36.749 done parsing https://cmal.info/2016/online-legal-publishers-squabble-over-the-right-to-copyright-the

That's 350ms for fetching, 1500ms for parsing!

snarfed commented 8 years ago

hoooooly crap, really?!? that's...not OK.

also profiling++ obviously. thanks for doing this!

i wonder if switching HTML parsers would help? i guess it's time to cpu profile mf2py.

kylewm commented 8 years ago

Just doing some basic dumb testing now locally. Lxml doesn't seem significantly faster :(

On February 26, 2016 8:56:26 AM PST, Ryan Barrett notifications@github.com wrote:

hoooooly crap, really?!? that's...not OK.

also profiling++ obviously. thanks for doing this!

i wonder if switching HTML parsers would help? i guess it's time to cpu profile mf2py.

Reply to this email directly or view it on GitHub: https://github.com/snarfed/bridgy/issues/578#issuecomment-189369978

Sent from my Android device with K-9 Mail. Please excuse my brevity.

kylewm commented 8 years ago

Huh, actually html5lib uses lxml if lxml is installed. And it's a hell of a lot faster with it installed (but still slow). Parsing my homepage and all permalinks from there:

total time in requests.get: 0.322495
total time in BeautifulSoup ctor: 0.999292
total time in mf2py.parse: 2.284201

kylewm commented 8 years ago

I'm at a loss for speeding up mf2. And making the fetches asynchronous is moot when so much more time is spent processing the result. I'll continue this on #624

snarfed commented 8 years ago

whoa, they added a "Billed instances estimate" (green) to the instances graph, and it's noticeably below the total instance count (blue), which is what i thought they were charging for. nice!

snarfed commented 8 years ago

...hmm. oddly the billed instances estimate is sometimes higher than the total instance count. confusing. 🤔

snarfed commented 8 years ago

we get Twitter favorites by fetching and scraping html for each tweet, in serial. i thought parallelizing these with async utlfetch could be a big win, but i looked at ~10 Twitter polls anecdotally, and most didn't fetch any favorites HTML pages, and the rest only fetched one. looks like our caching logic is working. yay. measure first then optimize, and all that!

snarfed commented 8 years ago

we've been back on min_idle_instances: 0, max_idle_instances: 1 for a few days now, and it definitely does cut down instance count, but anecdotally it also hurts interactive latency pretty badly. :/ (cc @tantek @aaronpk.)

as an experiment, i dropped max_idle_instances: 1 and kept min_idle_instances: 0 to see if that helps latency while still keeping some of the instance count improvement. we'll see.

snarfed commented 8 years ago

ok so https://cloud.google.com/trace/docs/analysis-reports is pretty great. aggregated latency reports for any subset of URL, HTTP method, etc.

here are some comparing default auto scaling to min_indle_instances: 0. default is better latency, as suspected, but it doesn't look like a huge difference. maybe i'm reading the graphs wrong? not sure.

/ (home page):

/twitter/* (user pages):

/like/* (mf2 handlers):

poll and propagate are also interesting but less meaningful. we only really care about fine grained latency for interactive pages and mf2 handlers.

poll:

propagate:

snarfed commented 8 years ago

interesting data point: anecdotally, when i've glanced at the poll queue over the last week or so, it's often idle or only running one poll. i think it used to consistently be running 4-8 at any given time. so, that implies #624 is largely done.

...however, with the default auto-scaling options, we still hang out at two instances most of the time. so, either app engine's auto scaling just still doesn't handle our workload well (very possible), or there's a lot of load we're missing.

here's some back of the envelope math: we avg .3qps right now, ie ~26k requests per day. over the last 24h, we've had ~16k polls, ~2k propagates, and a handful of front page, retry, publish, propagate-blogpost, etc...but nowhere near the remaining 8k. i'm going to look at logs now to see what those were. they might be bots, in which case i can add user pages to robots.txt...

snarfed commented 8 years ago

look at the last week, the remaining 8k qpd include:

1200 WordPress (every fetch is repeated 3x :( )
624 http://airbornesurfer.com crawling a bunch of his twitter and FB mf2 handlers. :/
525 Domain Re-Animator Bot (and ignores robots.txt. will blacklist.)
500 static
400 Known (conveniently exactly 1/3 of WordPress :P)
304 X-ray
82 Bridgy (what the?)
60 Googlebot (of which 2/3 are the home page)
53 Yahoo
40 bingbot

searching for all bots (ie \*[Bb]ot\*) shows ~930 qpd, which is totally reasonable (if very inexact).

kylewm commented 8 years ago

hmm, I know airbornesurfer IRL, sort of doubt he knows that's happening (he should just be on the vanilla wordpress plugin).

kylewm commented 8 years ago

1200 WordPress (every fetch is repeated 3x :( )

micro-caching might help?

snarfed commented 8 years ago

yeah i guess that'd be #15. (that issue number tho! :P)

also i pinged airbornesurfer: https://twitter.com/schnarfed/status/709519485795840001

snarfed commented 8 years ago

more importantly, 1200 qpd is <1qpm, and those requests average 1-2s, so i don't think that's really our problem. same with all the other requests in that breakdown.

i think app engine's auto scaling just really doesn't like our workload. if we can convince ourselves that min_idle_instances: 0 doesn't really hurt latency too much - which i'm not at all sure of - i vote to leave that on and call it a day.

snarfed commented 8 years ago

go home billed instance estimate ur drunk

snarfed commented 8 years ago

so after tons of work and soul searching and waffling, i think all the optimization here was nice and appreciated, but really the one thing that did the trick all by itself (and didn't drive our latency through the roof) was max_idle_instances: 1. whee. :P

tentatively closing for now.

snarfed commented 8 years ago

incidentally, wow, billed instance estimate really does just get totally drunk sometimes. no clue why. funny.

snarfed commented 8 years ago

weird new instance scaling pattern recently. we had a good month or two where our instance count mostly hung out around 1, but now it's jumping up to 2 for periods of ~12h (and right now >24h). i haven't found that it corresponds to anything yet (request volume, cron jobs, backups, etc).

for reference, we got billed for 38 instance hrs on 5/6, 39 on 5/5, and 40 on 5/4, ie much closer to the blue line than the yellow Billed instance estimate. boo.

snarfed commented 8 years ago

the weird pattern continues. silly. at least they're billing us for close to the estimate (yellow) instead of the total (blue).

snarfed commented 7 years ago

weird pattern is gone, more or less.

snarfed commented 6 years ago

the granary/o-d python 3 migration bumped up our instance count for some reason, not sure why except that maybe python-future adds meaningfully more overhead?

this moves task queue and cron processing into a separate 'background' service. this decouples quick, light user-facing requests (front page, user pages, etc) from heavier background tasks (poll, propagate, etc), which app engine's scheduler understands and handles better. it now runs the frontend 'default' service on ~1.5 instances avg, and lets me control the 'background' service's instances more directly with manual or basic scaling.

(i'd include instance count graphs, but they're buggy right now and the console isn't showing them. :/ )

background: https://cloud.google.com/appengine/docs/standard/python/an-overview-of-app-engine https://cloud.google.com/appengine/docs/standard/python/modules/converting https://cloud.google.com/appengine/docs/standard/python/config/appref#basic_scaling https://cloud.google.com/appengine/docs/standard/python/config/appref#manual_scaling

snarfed commented 6 years ago

now that facebook is dead (#826 #817), i tried an experiment to see if a single frontend instance could serve both interactive requests and poll/propagate tasks. i disabled the dedicated background service and moved task queues back to targeting the frontend (default) service.

turns out...no. even if a single frontend instance may theoretically be enough to handle the load, app engine's scheduler still just doesn't like serving two very differently shaped workloads from the same service. it served on 3-4 instances on average, and billed us for 2-3.

i flipped the switch 8/16 1pm PST. background service requests: