Decide how to handle "probably" relevant comparisons

Mark-Simulacrum commented 3 years ago

This comparison was categorized as probably relevant - https://perf.rust-lang.org/compare.html?start=7611fe438dae91084d17022e705bf64374d5ba4b&end=bcfd3f7e88084850f87b8e34b4dcb9fceb872d00&stat=instructions:u - but I think it is definitely so, due to the wide range of -doc benchmarks affected.

rylev commented 3 years ago

wide range of -doc benchmarks affected

Do you want to try to come up with some threshold for the number of test case results with the same profile but with small magnitudes should be considered "definitely" relevant?

Mark-Simulacrum commented 3 years ago

I think that probably makes sense; it'll probably be worth iterating on but maybe we can start by saying something like 10 benchmarks pointing in the same direction from the same profile is enough to bump us to definitely? Not sure if 10 will feel right in the long run, but it feels like a good start.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=b6057bf7b7ee7c58e6a39ead02eaa13b75f908c2&end=c02371c442f811878ab3a0f5a813402b6dfd45d2&stat=instructions:u is a similar case where the regression consistently shows up across the many-assoc-items benchmark -- all of check,debug,opt are regressed.

It was judged "probably" rather than definitely, but seems like it should have just been in the mixed category.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=a8387aef8c378a771686878062e544af4d5e2245&end=b27661eb33c74cb514dba059b47d86b6582ac1c2&stat=instructions:u is another example -- await-call-tree shows several profiles regressing.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

rylev commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

Did this catch your eye because all of the improvements are in incremental?

https://perf.rust-lang.org/compare.html?start=b6057bf7b7ee7c58e6a39ead02eaa13b75f908c2&end=c02371c442f811878ab3a0f5a813402b6dfd45d2&stat=instructions:u is a similar case where the regression consistently shows up across the many-assoc-items benchmark -- all of check,debug,opt are regressed.

It was judged "probably" rather than definitely, but seems like it should have just been in the mixed category.

Hmm... this seems pretty border line to me. I agree it's probably worth a look, but the categorization of probably relevant seems correct over definitely relevant.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

Did this catch your eye because all of the improvements are in incremental?

No, await-call-tree across 3 profiles. The incremental improvements are less 'interesting' typically, particularly when looking at incr-unchanged benchmarks -- those just have less going on so smaller changes are more likely to show up as significant.

Mark-Simulacrum commented 3 years ago

Hmm... this seems pretty border line to me. I agree it's probably worth a look, but the categorization of probably relevant seems correct over definitely relevant.

Hm -- so maybe our understanding is different, but the way I see it -- if we expect a human to go investigate, then it should be autocategorized as definitely. If that investigation is short, fine, but to some extent right now we would have missed this result without perf triage reports (just limiting to perf-regression labels) which seems unfortunate.

rylev commented 3 years ago

Our understanding of these categories is definitely different. It seems you have a "when in doubt it should be in triage" perspective. The original purpose of these categories was to differentiate between comparisons where we were sure that a performance change had happened (and we just need a human to help investigate the cause and determine if the code change in question is worth accepting the performance penalty) vs. changes where we're unsure (although fairly confident) if an actual performance change has happened or if it's just noise (and we need a human to first determine whether this change is actually a performance change and not noise). This is why the names are "definitely relevant" and "probably relevant" which describe this distinction pretty well.

In particular, originally I purposefully created a higher bar for triage than for perf-regression labels. This is why anything that is labeled as "probably relevant" or "definitely relevant" gets a perf-regression label but only "definitely relevant" changes get included in triage. Triage was to be reserved for changes where we are sure that there is an actual performance change.

It sounds like perhaps you're advocating for getting rid of the "probably relevant" distinction (since we've gotten relatively good at filtering out noise) and we should only make a distinction between "definitely relevant" and "maybe relevant" the former of which we label with perf-regression labels AND note in the triage report and the former we simply ignore (as we currently do with such cases). Thoughts?

rylev commented 3 years ago

Based on a conversation with @Mark-Simulacrum it sounds like we might want to do the following (please correct me if I'm misremembering something):

Continue to include everything that is "definitely" OR "probably" relevant in the triage report. We can discuss whether it makes sense to keep them in separate sections or not.
Come up with some mechanism for marking "probably" relevant changes in PRs (both after try runs and after merges to master) as "interesting to the performance team" - something like a perf-relevant label. This would be in contrast to the perf-regression label which is explicitly meant to be acted on by the PR authors and reviewers. The perf-relevant label would be a way to indicate to the perf team that we may want to investigate to better understand how performance changes in the compiler, but we don't expect PR authors or reviewers to act upon this.
Change the messages that perf bot leaves to better reflect reality. Currently, perf bot might say that there were no "relevant" changes found, but then the PR author clicks the link and sees many red test cases. This is correct since "relevant" labels take into account size of deltas and their magnitude above the significance threshold. We should, however, more clearly state that only small and/or insignificant changes were seen, and make it clear that even though there were some test cases that regressed, they did so in a way that we don't expect the PR author or reviewer to address.

rust-lang / rustc-perf

Decide how to handle "probably" relevant comparisons #983