Open joyeecheung opened 6 years ago
BTW on why I chose to include the full URL in the flakes.draft.json
: the Microsoft acquisition thing (it's unlikely that we will want to move, but anyway)
@joyeecheung @nodejs/tsc Should this/could this be a strategic initiative?
Also, what is current state? Flakes seem to be tracked as issues in nodejs/node, so what is current state? I'd never seen this repo until moments ago!
FYI, I've contact some folks who have offered to look at windows platform issues in the past, and got some promises that some of them will look at the windows flakiness.
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master/ is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.
I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.
ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.
Currently @Trott and myself get emails when "Daily Master" fails. I wanted to setup a mailing list or an RSS/ATOM feed for that so anyone can easily track it's status.
I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.
IMHO that's currently our best way to structure the known-flaky DB as test configuration. But without a champion those can turn from temporary measure into permanent "solutions", as we do have some tests that have been flaky for a long time, e.g. https://github.com/nodejs/node/issues/23207, https://github.com/nodejs/node/issues/26401, https://github.com/nodejs/node/issues/20750.
If its a strategic initiative, the current count of ignored tests will get reported at every TSC meeting. No guarantee they will get fixed, of course, but at least they will be visible.
I'd like to find a place to point contributors at the flakes issues as good first contributions, but didn't. I was hoping to find somewhere in our docs that pointed to the good first issue
label, but the only reference i found was in doc/onboarding-extras.md, which seems a bit obscure. Is there no "so, you want to help out with Node.js, great, look here" section of our docs?
Perhaps I should simply label all the ci flakes as good first issue
?
I think they are good first contributions because they often require no knowledge of node internals. If the tests are written entirely in terms of the public node.js API, then someone with sufficient knowledge of the API alone could take a crack at any of them. For example, some expert user of workers might not have the C++ ability to contribute to the worker implementation, but be perfectly capable of reasoning carefully about the worker tests, and making them more robust. Fixing a flake also adds appreciable value to the project, so make great drive-by contributions for those so inclined.
I'd use help wanted
and maybe even mentor available
for the flaky tests, but definitely not good first issue
. They often require very specific knowledge and/or access to a particular platform, often at a particular version. There's also a lot of what you might call "institutional knowledge" about how to reliably reproduce the flakiness, since a straightforward run of node-stress-single-test often doesn't do it.
addressing CI failures require an organized and continued team work. Individual efforts are not sustainable, and people get worn out easily.
I agree with @Trott , these are definitely not good-first-issues.
Often, the problem determination of these failures takes 2 phases:
second part is real hard.
I am +1 on running this through tsc.
The idea proposed by the OP has never been used in practice. Sometimes I use ncu-ci walk pr
to gather suspicious failures and their patterns and open a thread here, but I have not been able to do so for some time now and ncu-ci
now sometimes have trouble understanding windows failures (somehow neither does Jenkins CI understand those). The source of truth about the flakes is still the nodejs/node issue tracker.
I think it's worth being a strategic initiative, but we'll need a champion (or more) who has enough time to keep an eye on this.
Arguably, this could be a part of the existing strategic initiative for Build WG resources.
I'm definitely +1 as a strategic initiative if we can find a champion or if @Trott is volunteering by integrating it into the existing Build strategic initiative. I agree raising the awareness of the flakes by reviewing the numbers in the TSC meeting would be a great first step.
I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(
I'll think about what I have time to do (vs what I wish I could do with more time).
I looked at the build strategic initiative, and I'm not sure it is a fit for this. It is related, in that build infratructure failures can manifest as CI failures. Problems with host memory, disk, jenkins disconnections, tap2xml failures, etc. Those are within scope of build resources.
However, the flaky tests in particular seem to be solidly non-build, they need developer effort. The build team doesn't seem to me like it should take on the extra scope of fixing unreliable github.com/nodejs/node unit tests, its strained already.
I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(
Refs: https://github.com/nodejs/node/issues/27611
Also, Joyee has a proposed fix as of 7 hours ago. https://github.com/nodejs/node/pull/28170
@Trott while highlighting an urgent problem will have some benefit, I think reporting on the trend in terms of CI stability would possibly be more useful. ie are we slowly degrading/getting better and how quickly versus simply. That is something I think the TSC should be on top of and work to try to influence as opposed informing the TSC that some issue needs more help right now and hoping the TSC members have time to help out on that specific issue. Just my 2cents though.
Specifically
After we settle down on those I can start implement the actions in ncu-ci to automate the CI status reporting and flake tracking process