mozilla / ensemble

The platform that powers the Firefox Public Data Report :violin: :trumpet: :musical_keyboard:
https://data.firefox.com/
Mozilla Public License 2.0
20 stars 14 forks source link

Load testing #63

Closed openjck closed 5 years ago

openjck commented 6 years ago

We can generate some dummy data to go through the whole pipeline to simulate true network activity. Meaning, we can host a data with fake numbers (but a similar file size) on Mozilla's data servers, have that go through ensemble-transposer, and ultimately be rendered by ensemble.

pdehaan commented 6 years ago

Just to get my brain started on this, did we have any ideas of the type of load testing we're wanting to do? Is this just seeing how many concurrent users can view the dashboard? Or were you thinking of trying to have 10-3000 computers/processes generating dummy data to push through the ensemble-transposer pipeline to do more backend load testing? It will also be a lot easier for load testing if we could host a server somewhere (w/ similar specs to the production environment) that isn't behind an LDAP fence, so I don't need to put my LDAP creds in a config file somewhere.

openjck commented 6 years ago

Good news! We have a demo site up right now, using the same infrastructure that will ultimately be used for the final site, and it's not behind LDAP.

https://moz-ensemble.herokuapp.com/

This site uses fake data but that fake data goes through the whole pipeline: the fake data is hosted on S3, which is fetched and processed by ensemble-transposer, and the output of ensemble-transposer is ultimately fetched and displayed by ensemble

So in other words, the only real difference between this setup and the final setup is that ensemble-transposer is fetching fake data from S3. To publish the real data, all we really have to do is point ensemble-transposer to the real data.

So we could in theory start load testing the site now. We might continue to make performance improvements in response to what you find. If we do, it would be great to load test the site again shortly before the final launch.

We will have to talk to Rob Miller about expected traffic. But in general, it would be good to know how many concurrent users the site can handle without slowing down to a crawl. We also want to be sure our Redis server can handle the load, we don't rack up to many expenses with Heroku, and so on. Finally, if it's possible, it would be interesting to know how load times compare in different geographic regions. We're not currently using a CDN but we could if there's a compelling reason to do so.

Both ensemble and ensemble-transposer are currently using one web dyno with Heroku. We could increase that based on our expected audience or enable auto-scaling. Auto-scaling comes with a fee, however.

pdehaan commented 6 years ago

Cool. Were you more interested in seeing: (a) how many users can concurrent users the front end can handle (basically use Selenium or something to load the homepage at scale), or (b) how much fake data we can push through to S3 to see how well the backend transposer+s3+redis pipeline performs?

openjck commented 6 years ago

Some combination of those two.

I'm not sure if I understand what you mean by pushing data through to S3. The data engineers have a process which automatically uploads new data directly to S3 once per week. The data is then pulled from there.

For example:

User ---visits---> User Activity dashboard ---downloads---> processed user-activity JSON from ensemble-transposer ---downloads---> raw user-activity JSON from S3

or

User ---visits---> Usage Behavior dashboard ---downloads---> processed usage-behavior JSON from ensemble-transposer ---downloads---> raw usage-behavior JSON from S3

ensemble-transposer is needed because it adds metadata to the raw JSON from S3. Does that make sense?

openjck commented 6 years ago

So in other words, any user who loads a dashboard page (User Activity, Usage Behavior, or Hardware) is also loading data from ensemble-transposer and ultimately from S3 under the hood. (Although ensemble-transposer caches the data it gets from S3 for 24 hours.)

So we should load test all of the dashboard pages. I guess it wouldn't hurt to also load test the homepage.

rafrombrc commented 6 years ago

@pdehaan Thanks for looking at this. We're interested in how the site performs with a large number of users trying to connect concurrently. We don't need to do any testing of the back-end infra, except to ensure that it performs adequately when the front end site is under load.

pdehaan commented 6 years ago

Awesome. I have time this week to take a stab at this. But first, a few more questions:

  1. Are we planning on hosting this on Heroku and not AWS? (Not sure if we can get datadog working w/ Heroku, so our monitoring may be somewhat limited).
  2. Is there an assigned OPs person, or are you self-managing this?
  3. Are you planning on auto-scaling using Dynos? Or is that disabled and we're only using a single dyno? (For load testing, turning off auto-scaling is usually preferred so we know how much a single host can handle before tipping over).
  4. Define "a large number of users". Are we expecting dozens of concurrent users? millions? Not sure if you have a target we should aim for, or if we're just trying to get the general capacity of a single Heroku Dyno instance.
  5. Is https://moz-ensemble.herokuapp.com/dashboard/user-activity the latest code? (It isn't behind LDAP, but also seems to be broken at the moment) Seems to be back w/o background images or formatting, so maybe i'm just viewing it mid-deploy. Although it does have somewhat of a lorem ipsum feel to it.
  6. Can we add the typical OPs routes of https://moz-ensemble.herokuapp.com/__version__? It helps us know what SHA is currently deployed so we know if we're testing a stale build or not. Currently I don't know how fresh the code is. Not sure if we want /__heartbeat__ and /__lbheartbeat__ endpoints too, per https://mana.mozilla.org/wiki/display/SVCOPS/New+Service+Guide#NewServiceGuide-StandardEndpoints
openjck commented 6 years ago

I spoke with Rob about this and we have some answers:

  1. For now we are planning to use Heroku. I think push-dev-dashboard used both Heroku and datadog.
  2. For now we are self-managing this.
  3. We may enable auto-scaling when we ultimately launch the site. For now, let's load test one dyno.
  4. Unfortunately the range of possibilities here is quite large. The site will probably be interesting to geeks like us: reddit, Hacker News, etc. It's possible that this information will be interesting to a wider audience and the site will be picked up by more mainstream outlets, but we don't know for sure. If it makes sense to you, we can start load testing one dyno, get some rough numbers for that, and then set up the autoscaling range accordingly.
  5. I think you caught it mid-deploy. The site is very Lorem Ipsum for now, but that's just because we want to keep the data private for now. We're sourcing two JSON files with fake data, but everything else (the code itself, the infrastructure, etc.) is identical to what we will use in production.
  6. Yes. I opened #87 for that. Do you also want these on ensemble-transposer?
openjck commented 6 years ago

If we can get by without using autoscaling, that would be ideal. It's 10x more expensive than Standard 1X dynos, which we are using now. Ouch.

openjck commented 6 years ago

@pdehaan, in your experience, is Heroku scaling generally linear? That is, can 10 dynos generally handle 10x the capacity of 1 dyno?

If not, after we test load test 1 dyno, I can try manually enabling a different number of dynos, say 10, so that we can get a sense of what that capacity would be.

pdehaan commented 6 years ago

@pdehaan, in your experience, is Heroku scaling generally linear? That is, can 10 dynos generally handle 10x the capacity of 1 dyno?

If not, after we test load test 1 dyno, I can try manually enabling a different number of dynos, say 10, so that we can get a sense of what that capacity would be.

I've only used Heroku for small, personal projects which never needed scaling (mostly just scrapers and Twitter bots). And in QA we generally use OPs supported AWS machines, so my experience is somewhat very limited. Not sure how efficient it'd be to load test autoscaling w/ Heroku, unless we have lots of options to control when and how it should scale. If Heroku handles that logic for you, the load test results probably wouldn't be too useful, and we'd only really be testing your credit card's throughput at that point.

openjck commented 6 years ago

Heh.

We're probably not going to enable autoscaling at all because it's too costly. So it might be good to get a sense of how much traffic 10 dynos, for example, can handle. Then we would have the option of manually enabling that many when needed.

openjck commented 6 years ago

To be clear, it would be good to also separately test how much traffic 1 dyno can handle. That'll give us a range of dynos that we can manually enable.

openjck commented 6 years ago

I think we're ready for this if you are. NB: We'll want to test the Heroku site, not the metrics.mozilla.com site. The Heroku site uses the same infrastructure we'll use in production, even though it currently shows fake data.

pdehaan commented 6 years ago

Yeah, I've been writing a couple of selenium tests locally which we're hoping to use our load testing tool to run at scale. Most of the load testing work we've done has been around API testing, so the Selenium parts are a bit newer and may have a few kinks left to work out.

I was going to say that https://github.com/mozilla/ensemble/issues/102 is somewhat of a blocker for us, since it may be hard to tell if the page is still functioning when we throw oodles of traffic to it, or if the page will always return a status code of 200 and our tools will never know if the server is starting to crack and fail.

Although, I still strongly suggest that you guys consider getting OPs involved and have them host this on our battle proven AWS stacks, and then they can handle the auto-scaling and notifications and everything else that OPs excels at. Self managing your own stuff can get very risky and having to be on call when something stops working at 2am on a Sunday is never fun. And reinventing scaling and performance alerts probably isn't a fun task either.

openjck commented 6 years ago

I was going to say that #102 is somewhat of a blocker for us, since it may be hard to tell if the page is still functioning when we throw oodles of traffic to it, or if the page will always return a status code of 200 and our tools will never know if the server is starting to crack and fail.

Individual pages won't 404, but the application itself could 404 under heavy load. As a single-page app, it's all or nothing. Either the whole application will 200 or none of it will. That's one of the things I don't love about create-react-app; using a server-side rendering framework like next.js would probably address that.

Although, I still strongly suggest that you guys consider getting OPs involved and have them host this on our battle proven AWS stacks, and then they can handle the auto-scaling and notifications and everything else that OPs excels at. Self managing your own stuff can get very risky and having to be on call when something stops working at 2am on a Sunday is never fun. And reinventing scaling and performance alerts probably isn't a fun task either.

Good to know. I've opened #108 for that. I doubt we'll be able to do that before launch, but I think it's a good goal nonetheless.

openjck commented 6 years ago

Hi Peter. Has this started yet? I saw a lot of hits on Google Analytics but maybe that was just a result of me testing the site.

pdehaan commented 6 years ago

@openjck I was starting some load testing last week, but running into some unexpected errors (https://github.com/mozilla/ensemble/issues/153#issuecomment-386817077).

I'm not seeing the 404s currently as of https://github.com/mozilla/ensemble/commit/11d278e074ff8780c9fb09783bcfabd85b8ebe0a on the public Heroku site, so I'll retry my load tests and see if I see the errors again.

pdehaan commented 6 years ago

Here's what I see when load testing (followed by errors because it cant find/wait-for the #dashboard-title element).

404_not_found_heroku

I can also reproduce this w/ Firefox 60 Beta and a clean profile.

ff_beta

Oddly, I can only seem to get that error when going to that example dashboard page directly, and not if I go to the homepage and then click the "Example Dash 1" link in the header. So I may be able to work around the error by clicking header links and having a slight delay in the test as it redirects.

Only ideas I can think of are:

  1. Something is broken due to SSR.
  2. Something is broken due to dummy data.
openjck commented 6 years ago

Something is definitely not working as expected. We didn't land SSR so it's not that. I think I know what's causing it. Lemme try to fix it...

openjck commented 6 years ago

Fixed. I'm sorry about that. It was an oversight on my part.

Unfortunately, we have to show the nginx 404 page instead of our prettier version. I explain why in this commit message, but basically, using our pretty 404 page on production risks making some errors invisible. In fact, if it hadn't been for that commit, that URL would have returned a 404 despite rendering properly. That would be weird and would mask important problems. Unfortunately, I don't know a way around this without switching to SSR.

pdehaan commented 6 years ago

Awesome, thanks. I'll take another poke at some load testing tomorrow and see if I'm still getting unexpected 404s on a new profile and report back.

openjck commented 6 years ago

Hey Peter. Any luck with this so far?

pdehaan commented 6 years ago

@openjck Yeah, @rpappalax and I were hacking on this for a few hours today (2-6:30pm) and made some progress.

Not sure how to interpret the results yet. Does Heroku give you any perf data (memory/CPU usage)? We're flying a bit blind.

But based on our local testing w/ molotov and molosonic we were able to get 3.1-4.7 requests per second (RPS) on my laptop for the homepage. "Example Dashboard 1" was worse at 1.6-3.3 RPS, and "Example Dashboard 2" was worse still at 0.6-1.4 RPS.

The tests are at https://github.com/pdehaan/ensemble-loadtests and our early results are in Google docs. Increasing processes and workers helps to a point, but eventually is bound by CPU on my machine. But each of the tests basically just load the page, waits for the JSON payloads to land, and waits for an element on the page to be visible (either the introduction text on the homepage, or #dashboard-title on the 2 dashboard pages).

openjck commented 6 years ago

Apologies for getting behind on this. Heroku does provide some performance data. I added you to the app and will email you a link to their metrics.

openjck commented 6 years ago

Also, I can't see the ensemble-loadtests repo. I may need to be granted read access.

pdehaan commented 6 years ago

Some more promising results from some load testing. We threw load at the production JSON transposer endpoints and then did some manual testing of the site (while the transposer was under load). We were finding that the performance "edge" was somewhere around 70RPS before the transposer server reached a singularity...

I'll keep poking it with a stick to see if I can push more errors. But if you have some logging on the Heroku Transposer instance, it may be interesting to see what the CPU/Memory usage was like.

I also added you, @openjck, as a collaborator on the ensemble-loadtests repo (which is private to match this repo), and I'll push up my transposer test changes to the repo shortly.

$ molotov --duration 90 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 1 worker...
OK
SUCCESSES: 264 | FAILURES: 0 | WORKERS: 1

2.933333333 RPS (1 worker)

---

$ molotov --duration 180 --workers 10 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 10 workers...
OK
SUCCESSES: 5329 | FAILURES: 0 | WORKERS: 10

29.605555556 RPS (10 workers)

---

$ molotov --duration 90 --workers 25 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 25 workers...
OK
SUCCESSES: 5443 | FAILURES: 0 | WORKERS: 25

60.477777778 RPS (25 workers)

---

$ molotov --duration 60 --workers 50 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 50 workers...
OK
SUCCESSES: 4299 | FAILURES: 0 | WORKERS: 50

71.65 RPS (50 workers)

---

$ molotov --duration 60 --workers 100 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 100 workers...
OK
SUCCESSES: 4497 | FAILURES: 1 | WORKERS: 100

74.95 RPS (100 workers)

TODO:

  1. TRY 50 WORKERS FOR 10 minutes and watch for errors.
  2. TRY 150 WORKERS for 10 minutes and watch for errors.
openjck commented 6 years ago

This is great. Thank you so much for this report. I have a few questions. I'm new to load testing, so these are out of genuine curiosity rather than judgement. I trust you're doing the right thing.

We threw load at the production JSON transposer endpoints and then did some manual testing of the site (while the transposer was under load).

Great! I'm glad you tested the site itself. We want to be sure that holds up.

When you say the transposer was under load, do you mean you directly hit both ensemble and ensemble-transposer simultaneously? Is there an advantage of hitting ensemble-transposer directly when ensemble hits it indirectly?

We were finding that the performance "edge" was somewhere around 70RPS before the transposer server reached a singularity...

Can you define singularity? I assume you mean 70 requests per second before ensemble-transposer is unable to respond to requests in a reasonable amount of time. Is that correct?

I'll keep poking it with a stick to see if I can push more errors. But if you have some logging on the Heroku Transposer instance, it may be interesting to see what the CPU/Memory usage was like.

Great! I'll take a look and make sure you have access to see this, too.

TODO:

TRY 50 WORKERS FOR 10 minutes and watch for errors.
TRY 150 WORKERS for 10 minutes and watch for errors.

In this context, is a worker a client that's hitting the site? What can be learned by adding more?

Finally, once we finish with this round of tests, we'd love to scale up the Heroku dynos on both services and see if load scales linearly. For example, if we have 10 dynos do we get 700 RPS from ensemble-transposer? That'll help us know how much to scale up if we ever need to.

openjck commented 6 years ago

I opened https://bugzilla.mozilla.org/show_bug.cgi?id=1469026 to make the metrics visible to you and Richard.

pdehaan commented 6 years ago

When you say the transposer was under load, do you mean you directly hit both ensemble and ensemble-transposer simultaneously? Is there an advantage of hitting ensemble-transposer directly when ensemble hits it indirectly?

Yeah, so in running $ molotov --duration 180 --workers 250 loadtest_json.py, we were trying to put as much traffic/stress on the transposer endpoint as we could, and then testing the front end performance of the site, to see if performance was degraded in any way. Since this is a 2-step React solution, the page itself loaded quickly, and then we just had to watch the page's preloader for 4-10 seconds. I think given the general size of the JSON payload, it's realistic to say that we'll never have great site performance since the the JSON blobs are ~195KB-709KB (and will only grow, unless we clip them at X weeks).

I don't know we had great reasons for this approach, apart from it was much easier than trying to wait for full page loads w/ Selenium, and using molosonic which was trying to launch 20 browsers on my laptop. And moving those tests into The Cloud(tm) generally involves Dockerhub, AWS, Ardere, and lots of blood and tears. So this was more of a cheap way of visually validating site performance as an end user, while the system was under a general "high" load. Although without looking into the transposer server metrics on Heroku, we were a bit blind. But I'll check again and see if you gave me access to everything, or just the stage front end logging on Heroku.

> GET /user-activity/ HTTP/1.1
> User-Agent: HTTPie/0.9.6
> HTTP/1.1 200 OK
> Content-Length: 200203 (195 KB)
> Date: Fri, 15 Jun 2018 19:17:54 GMT
> GET /usage-behavior/ HTTP/1.1
> User-Agent: HTTPie/0.9.6
> HTTP/1.1 200 OK
> Content-Length: 726973 (709 KB)
> Date: Fri, 15 Jun 2018 19:18:18 GMT
> GET /hardware/ HTTP/1.1
> User-Agent: HTTPie/0.9.6
> HTTP/1.1 200 OK
> Content-Length: 301786 (294 KB)
> Date: Fri, 15 Jun 2018 19:18:48 GMT

Can you define singularity? I assume you mean 70 requests per second before ensemble-transposer is unable to respond to requests in a reasonable amount of time. Is that correct?

Oh, it was just some seemingly artificial ceiling that we assumed we were hitting, since increasing threads and workers wasn't scaling equally with the RPS that we were seeing. So we assumed that ~80RPS for the /hardware endpoint was as good as we'd probably get (and we're either hitting a server limit, or a Marriott basement wifi limit, or a peter's 5yr old laptop limit). But we only ever saw at most 1 failure per short load test, and response times rarely exceeded 10s.

In this context, is a worker a client that's hitting the site? What can be learned by adding more?

Sorry, the TODO comments were for me, and I have TODONE them. A worker is just generally one agent. So a single browser instance. So running a load test with 1 worker/agent and 1 process would basically be like using a single laptop. Increasing workers and processes just launches more threads on the machine and simulates multiple different people concurrently hitting a server. So we've been generally increasing those numbers and putting them in spreadsheets and word docs and tracking results to see when things stop increasing as expected.

Finally, once we finish with this round of tests, we'd love to scale up the Heroku dynos on both services and see if load scales linearly. For example, if we have 10 dynos do we get 700 RPS from ensemble-transposer? That'll help us know how much to scale up if we ever need to.

Yeah. I have a few other projects I'm juggling, but I should have time for more testing once we've scaled to where you think you want. Or... we turn on auto-scaling and let Heroku manage adding/removing instances in step with the current demand. No point having 20 idle servers/dynos sitting around 24/7 if we only get traffic on this a few hours a day, or huge traffic spikes at launch and then a slow reduction over time. Counter proposal, hand off the site to OPs and they'll manage it all on AWS and do all the autoscaling for you and handle any spikes and outages and server restarts and deployments and give you stage+production environments and take all the stress and 24/7 monitoring off your hands. Because if this goes down in production on a Friday at 3pm during an all-hands and stays offline until a Sunday/Monday, that'd be pretty bad. Plus, OPs rule.

pdehaan commented 6 years ago

Ooops, looks like I forgot to submit this...


A few more results... I'll try and do some longer running tests today (although on hotel wifi, which can be spotty) and see if I can generate more errors on the "prod" guarded-plains transposer endpoint.

✗ molotov --duration 60 --workers 150 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 150 workers...
SUCCESSES: 4892 | FAILURES: 0 | WORKERS: 150

81.533333333 RPS (150 workers)

---

✗ molotov --duration 90 --workers 250 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Preparing 250 workers...
SUCCESSES: 6906 | FAILURES: 1 | WORKERS: 250

76.733333333 RPS (250 workers)

---

✗ molotov --duration 90 --workers 25 --processes 10 loadtest_json.py
**** Molotov v1.6. Happy breaking! ****
Forking 10 processes
[57559] Preparing 25 workers...
[57560] Preparing 25 workers...
[57563] Preparing 25 workers...
[57561] Preparing 25 workers...
[57564] Preparing 25 workers...
[57565] Preparing 25 workers...
[57562] Preparing 25 workers...
[57566] Preparing 25 workers...
[57567] Preparing 25 workers...
[57568] Preparing 25 workers...
SUCCESSES: 7898 | FAILURES: 1 | WORKERS: 250

87.755555556 RPS (250 workers)
openjck commented 6 years ago

Thank you for all this testing, Peter.

Peter thought it would be best if we pause on further testing until we move to AWS (#108). That works for me.

openjck commented 5 years ago

DataOps is now running this site behind a CDN. I'll address ensemble-transposer load in that repo.