Test if Saucelabs' iPhone tests are all broken now

nolanlawson commented 8 years ago

Saucelabs has been failing for iPhone/iPad in a lot of my repos recently. Want to test to see if it's happening here as well.

nolanlawson commented 8 years ago

Seems like more than just iOS is failing. IE as well.

nolanlawson commented 8 years ago

Hmm, Zuul output is typically like this:

- restarting: <iphone 7.0 on Mac 10.9>
- restarting: <ipad 7.0 on Mac 10.9>
- starting: <iphone 9.1 on Mac 10.10>
- restarting: <iphone 7.0 on Mac 10.9>
- starting: <iphone 8.4 on Mac 10.10>
- restarting: <ipad 7.0 on Mac 10.9>
- waiting: <iphone 9.1 on Mac 10.10>
- waiting: <iphone 8.4 on Mac 10.10>
- waiting: <iphone 9.1 on Mac 10.10>

filmaj commented 8 years ago

OK, so your tests rely on Sauce Connect (a tunnel). Those look to be starting and stopping fine - that's good.

If I could get your Sauce username, then I can look at the details of your account as well as individual jobs relating to each of the failures, and can give you a better idea of what is going on behind the scenes.

Couple things in this build:

It looks like there's some kind of module failure happening?

Error: Cannot find module 'node_modules/is-buffer/index.js' from '/home/travis/build/nolanlawson/fruitdown'

The start-of-build output implies there are a bunch of tests being queued up:

- testing: iphone @ Mac 10.10: 8.0 8.1 8.2 8.3 8.4 9.0 9.1 9.2
- queuing: <iphone 8.0 on Mac 10.10>
- queuing: <iphone 8.3 on Mac 10.10>
- queuing: <iphone 8.4 on Mac 10.10>
- queuing: <iphone 8.1 on Mac 10.10>
- queuing: <iphone 8.2 on Mac 10.10>

Depending on what kind of account you are under, these tests may be getting throttled. Free accounts have a certain maximum concurrency. If you can provide your Sauce username, I can look at the details and I'm sure we can work it out. Sauce runs a couple thousand jobs concurrently at any given time, across all these platforms fruitdown tests against, so I think there's just a missing link here somewhere.

I'm happy to help to get to the bottom of it and get you testing / CI'ing it up again!

nolanlawson commented 8 years ago

Thanks for the tip! My sauce username is nolan_lawson, and I'm pretty sure that is-buffer is unrelated, since the tests passed fine on Travis using their Firefox instance. (I think it might be a Browserify warning, but I'm not sure.)

If I'm getting throttled, then maybe this is really a bug in Zuul that it allows so many browsers to get queued up at once? Or maybe it's my fault because I should be trimming down my .zuul.yml?

filmaj commented 8 years ago

I'm not familiar with zuul, but let me dig into your account details and come back with more info.

Any chance we can get Travis / Zuul to output the timing of when it "queues" up these tests? That might be helpful.

I'll report back shortly.

filmaj commented 8 years ago

If I'm parsing out your user account details properly, I don't think you actually have a concurrency limit, but I'm getting someone at Sauce with more familiarity with this side of the business to verify that.

In the meantime, your tests dashboard shows a bunch of potential problems:

screen shot 2016-03-06 at 11 55 39 am

Let's go over the errors one by one and I'll lay down theories as to what is going on:

Test did not see a command for 90s. I see these for promise-worker tests. If a VM starts in our cloud, and its Selenium server establishes a connection with the client, it will wait for a maximum of 90s without receiving a command before it shuts down. Seeing this error would imply that your test runner is not sending commands to Sauce.
New session request was cancelled before a VM could be found. All of your repos seem to be getting this error. For context, the very first command issued by a Selenium test is an HTTP POST to /session, which sets the groundwork for the kind of environment the test needs. It's in this phase that Sauce figures out what kind of VM to boot up for you. That said, Sauce pre-boots VMs of different environments, to try to ensure that the job-to-VM assignment time is as low as possible. This may or may not be "user error" - I will dig in for more details at some point and get back to you on this one.
Test exceeded maximum duration after 1800 seconds. This only seems to be happening for fruitdown tests. I see these for tests that are running for 30+ minutes. Looking at the test details, indeed, these are singular tests running for that long, issuing commands the whole time. I think you'll need to re-evaluate how you are doing testing here. Ideally, a single test covers a focussed test case. If you are mixing assertions / scenarios together into a single test run, then a failure in an earlier assertion necessarily will fail a later assertion. Splitting these huge tests up into smaller ones helps you more easily identify the specific failures, as well as parallelize your testing more easily.
The Sauce VMs failed to start the browser or device - this is almost always a legit Sauce problem. I will take a look!

Hopefully this can get us started in the right direction. It looks like it's a combination of both test runner problems on your end as well as Sauce problems on my end. I'll do what I can to investigate and help out! I have a busy day today (Sunday) but I'm happy to look into it as we head into the week.

Please don't be shy about pinging me on twitter / github to bring my attention back to here if I lose track after a couple of days.. I will try not to get it to that but, y'know, work's busy and stuff :P

filmaj commented 8 years ago

OK, more info on your account. You are under a open source "medium" account, which gives you a max concurrency of 10 VMs. That likely has something to do with the failures / long wait times for some of your tests.

nolanlawson commented 8 years ago

Hm, yeah, the long tests are probably not something I can easily fix. The root of those issues is that FruitDOWN is testing Apple's implementation of IndexedDB, which is really crazily slow. If Saucelabs recently lowered the limit or something, then that could explain these failures.

BTW in case it's not clear, these tests were passing a few months ago, and this build just demonstrates that nothing changed on my part and yet the tests are all failing. You can see the list of Travis builds here.

It's also possible that something changed in my npm dependencies (e.g. Zuul), although it would have to be something across all of my repos (see related PRs, which are fairly unrelated repos and yet are all failing for more-or-less the same reasons). So if it's not something on the Saucelabs side, then it's most likely Zuul.

nolanlawson commented 8 years ago

OK, it turns out this is indeed a Zuul issue: https://github.com/defunctzombie/zuul/issues/270. Sorry for blaming SauceLabs.

filmaj commented 8 years ago

Eh I actually think there is still a problem on Sauce's end - the "could not start VM" is more often than not a problem that deserves digging deeper. I'll pull up logs from those VMs, at least, so we get a better idea what's going on.

By the way, how do you manage the concurrency limit across multiple projects? Say if fruitdown uses 10 VMs concurrently, and so do your other projects, isn't there a chance you could be throttling yourself into timeouts?

nolanlawson commented 8 years ago

Yeah I basically just cross my fingers and hope I don't have many repos running their tests at once. :sweat_smile:

nolanlawson commented 8 years ago

Opened #6 to try to fix this issue by downgrading Zuul.

filmaj commented 8 years ago

Confirmed the concurrency limits for a medium account have not changed since at least before August 2013. I'll take a peek at the test failures under your account now to see if there's anything in there worth investigating more.

I might also take a look at the tests you have in these repos.. 30 minute long test runs usually imply you are testing multiple different assertions spanning several test cases in a single test environment.

Finally, we should see what kind of options the Sauce REST API offers to check your available concurrency and build logic around that into the test runner, especially if you use one account for multiple projects with different testing infrastructure needs.

filmaj commented 8 years ago

I've dug in a little bit.

First, I'm using this passing test running on Chrome on Windows XP as a baseline. It takes about ~20 seconds to run the test suite. The flow looks something like this: a URL is loaded, zuul starts executing the tests, and the test runner continually executes a bit of JavaScript (the POST to the /execute route in Selenium-speak) to check if the tests are complete (the JavaScript in question is return (window.zuul_msg_bus ? window.zuul_msg_bus.splice(0, 1000) : []);;). In the success case, when the tests are complete, that JavaScript will return [{"stats":{"failed":0,"passed":187,"pending":0},"type":"done","passed":true}]. Sweet.

The only failures that I see where something is really wrong are the tests running on iOS simulators. These are the environments where the tests run too long, or a command timeout happens.

Here's a command-timeout failure. This test runs for 20 minutes! Something is up. Digging in a little, you can see that the execute-javascript-to-check-for-test-completion polling is messed up. About a minute and 36 seconds into the test, the test page header turns green (which I think signals that the tests are done?). However, the javascript-execution that is supposed to return a stats object with a passed property of true never shows up. What makes this harder for me to diagnose is that I can't find in your account's history a case where the fruitdown tests passed on iOS. I can try to run these tests locally on my laptop (I have the iOS environment set up and am running Mac 10.10 myself) to see if this is a problem with either zuul, the fruitdown tests in particular, appium (the selenium server for mobile), or something in Sauce.

The other failures related to timeout are identical: the zuul-JS-are-the-tests-passed loop fails to return an expected result. This test reported to exceed the 1800 second test limit spins on the same issue. Similarly, the header of the test page turns green about 1:40 into the test, but continues on for a full 32 minutes.

I think if we can get to the bottom of this, your test reliability will go way up, and your build times will go way down. I'm happy to help!

Finally, I found a bunch of tests that errored out, straight up, due to concurrency limits being hit: "You've exceeded your Sauce Labs concurrency limit. This test was throttled and ultimately timed out waiting for a free slot to run". I suggest we open up a new issue (where? in this project? Zuul? where is the appropriate place?) to try to figure out how we can have the test runner be smarter about that. Let me know if you have ideas there.

filmaj commented 8 years ago

Regarding throttling based on concurrency, I think we can leverage Sauce's concurrency REST API to programmatically get concurrency limits as well as current-active concurrency in use. Here's an example of the response I got from that API about my own personal account, when I was running a single test:

$ curl -u filmaj:$SAK https://saucelabs.com/rest/v1.1/users/filmaj/concurrency
{"timestamp": 1457332998.466679, "concurrency": {"self": {"username": "filmaj", "current": {"overall": 1, "mac": 1, "manual": 0}, "allowed": {"manual": 100, "mac": 100, "overall": 100, "real_device": 30}}, "ancestor": {"username": "filmaj", "current": {"overall": 1, "mac": 1, "manual": 0}, "allowed": {"manual": 100, "mac": 100, "overall": 100, "real_device": 30}}}}%

The key bits there we could leverage is the "current" and "allowed" concurrency information. If we can build that into the test runner for all of your projects, then we can have the runner wait patiently for available concurrency before queueing up tests (and having them potentially fail and time out).

nolanlawson commented 8 years ago

Thanks for the analysis; seems this most definitely could be either a Zuul bug or a Saucelabs bug. BTW you may have better results if you test one of my other projects instead, since this one has unusually lengthy tests (blob-util) is probably a good candidate.

Also there is still some more discussion going on in https://github.com/defunctzombie/zuul/issues/270; you might want to check out what people are posting there, because there are more test cases to reproduce this.

filmaj commented 8 years ago

Cheers, will do.

nolanlawson commented 8 years ago

fixed in latest zuul

nolanlawson / fruitdown

Test if Saucelabs' iPhone tests are all broken now #5