TC Build times out on testGeckoDebugUnitTest

nojunpark commented 5 years ago

Steps to reproduce

Currently, two latest master branch TC runs have been failing with same reason. https://tools.taskcluster.net/groups/IZyLi8tlS4She-_BKkg7_w/tasks/IZyLi8tlS4She-_BKkg7_w/runs/0/logs/public%2Flogs%2Flive.log

> Task :app:compileGeckoDebugUnitTestJavaWithJavac
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

> Task :app:mergeGeckoDebugShaders
> Task :app:compileGeckoDebugShaders
> Task :app:generateGeckoDebugAssets
> Task :app:mergeGeckoDebugAssets
> Task :app:packageGeckoDebugUnitTestForUnitTest
> Task :app:generateGeckoDebugUnitTestConfig
> Task :app:processGeckoDebugJavaRes NO-SOURCE
> Task :app:processGeckoDebugUnitTestJavaRes
> Task :app:testGeckoDebugUnitTest

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2019-06-05 19:39:56.339Z] === Task Finished ===

nojunpark commented 5 years ago

This causes subsequent tests, including bitbar UI tests to not run.

dnarcese commented 5 years ago

It passes locally for me.

dnarcese commented 5 years ago

Changing the value for a property with a final value has been deprecated. This will fail with an error in Gradle 6.0.. Maybe this is the reason for failure

mcomella commented 5 years ago

I also saw this when doing the latest version bump in https://github.com/mozilla-mobile/firefox-tv/pull/2377

Changing the value for a property with a final value has been deprecated. This will fail with an error in Gradle 6.0.. Maybe this is the reason for failure

I think this is unrelated: it's a deprecation notice in the gradle APIs but theoretically should not indicate broken functionality.

mcomella commented 5 years ago

I added --info to the TC job in https://github.com/mozilla-mobile/firefox-tv/pull/2415 – we should let the PR fail and identify which test it's failing on.

Note that there are two possibilities:

testGeckoDebugUnitTest is timing out
(Severin pointed this out) Something before testGeckoDebugUnitTest is taking a long time, which causes testGeckoDebugUnitTest to time out

mcomella commented 5 years ago

I wonder if this is related to the a-c or dependency upgrade in #2347.

mcomella commented 5 years ago

I've retrigger 4+ times and I haven't been able to get #2415 to fail – I wonder if there's a race condition.

mcomella commented 5 years ago

I've retrigger 4+ times and I haven't been able to get #2415 to fail – I wonder if there's a race condition.

I still can't get it to fail. Is it still a problem in other PRs?

As for other options for debugging, I wonder if we can reproduce locally using our Docker image.

dnarcese commented 5 years ago

I removed the tasks testGeckoDebugUnitTest and testGeckoReleaseUnitTest and it timed out on testSystemReleaseUnitTest.

mcomella commented 5 years ago

My PR in #2415 failed:

The failure happened on:

org.mozilla.tv.firefox.pocket.PocketVideoRepoTest > GIVEN pocket key is not valid WHEN getting a new instance THEN feed state starts no api key STANDARD_OUT
    [Robolectric] org.mozilla.tv.firefox.pocket.PocketVideoRepoTest.GIVEN pocket key is not valid WHEN getting a new instance THEN feed state starts no api key: sdk=28; resources=binary

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
[taskcluster 2019-06-12 20:17:05.911Z] === Task Finished ===

We could investigate this test or see if there is another task that is taking a long time, making this task run so late this fails.

In this test, specifically:

assertEquals(FeedState.NoAPIKey, repo.feedState.blockingFirst())

Maybe blockingFirst() is blocking forever? That being said, I'd expect junit to kill the test after some timeout: is that not true?

nojunpark commented 5 years ago

@mcomella to my knowledge, you have to specify the timeout rule with either the test annotation, or use the Timeout class. I don't think it comes with default timeout for unit tests (I believe UI tests do have timeout value though)

This link might help: https://github.com/junit-team/junit4/wiki/timeout-for-tests

mcomella commented 5 years ago

@dnarcese Did you say you tracked down a solution that, by increasing the heap size of our gradle daemon, this is no longer issue? If so, should we write and land a patch? In any case, increasing the size of the gradle daemon generally seems like a good thing to do, given the increasing size of our project.

dnarcese commented 5 years ago

@mcomella Yes, it seems that increasing gradle daemon size is fixing the issue. I have been testing in this PR https://github.com/mozilla-mobile/firefox-tv/pull/2431. I will clean up those commits and request review.

mcomella commented 5 years ago

Note: dnarcese has a lot of notes on what was tried in #2431.

liuche commented 5 years ago

Next steps: @dnarcese will pair with @mcomella on this.

There is a workaround, but the goal here is to find the underlying cause. Sizing L because it can't be reproed locally, not consistent on TC, takes a long time to run/test. Will also need to communicate w/ other teams (releng, QA) to solve this. Build process is also difficult.

mcomella commented 5 years ago

We've been working with bstack to determine if this is a hardware performance issue. bstack's theory is:

fire-tv builds are more resource intensive and/or more frequent than other tasks we've had before such that enough of them are piling up on single workers at once to cause OOMing

He opened a PR to "use more recent instance types and get you on instances where a single build takes the whole machine (probably means your own instance type)": https://phabricator.services.mozilla.com/D36660

We're waiting for that to land and seeing if the intermittent is still a problem. In parallel, we've been trying a few other approaches to see if we can fix the bug in other ways.

mcomella commented 5 years ago

In my test PR https://github.com/mozilla-mobile/firefox-tv/pull/2505, I discovered that some tests actually timeout:

org.mozilla.tv.firefox.pocket.PocketVideoRepoTest > GIVEN pocket key is not valid WHEN getting a new instance THEN feed state starts no api key FAILED
    org.junit.runners.model.TestTimedOutException at PocketVideoRepoTest.kt:106

org.mozilla.tv.firefox.pocket.PocketVideoRepoTest > GIVEN pocket key is valid WHEN getting a new instance THEN feed state starts inactive FAILED
    org.junit.runners.model.TestTimedOutException at PocketVideoRepoTest.kt:100

We should do a quick root cause analysis and, if it'd take too long to fix or find the root cause, consider @Ignore the tests.

nojunpark commented 5 years ago

@mcomella Just curious, does it still time out if you increase the heap size too? (I know we have the better aws worker type, wondering heap size is changed as well)

In my test PR #2505, I discovered that some tests actually timeout:

mcomella commented 5 years ago

I'm temporarily disabling the intermittently failing tests because it's blocking the current release (4.0.0? v3.12?): https://github.com/mozilla-mobile/firefox-tv/pull/2517

liuche commented 5 years ago

@mcomella will investigate newly ignored tests + root cause and file, and land some mitigation with this bug.

mcomella commented 5 years ago

On Fri, Jul 12, 2019 at 11:27 AM liuche notifications@github.com wrote:

@mcomella https://github.com/mcomella will investigate newly ignored tests + root cause and file, and land some mitigation with this bug.

This is https://github.com/mozilla-mobile/firefox-tv/issues/2539 now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla-mobile/firefox-tv/issues/2378?email_source=notifications&email_token=AAFZMTBOADLRIA5B2VXHT5LP7DEKFA5CNFSM4HVCLSC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ2QU6Q#issuecomment-510986874, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFZMTEN43GGNFTGRBWTRN3P7DEKFANCNFSM4HVCLSCQ .

mozilla-mobile / firefox-tv

TC Build times out on testGeckoDebugUnitTest #2378

Steps to reproduce