rust-lang / rustc-perf

Website for graphing performance of rustc
https://perf.rust-lang.org
627 stars 148 forks source link

Decide how to handle "probably" relevant comparisons #983

Open Mark-Simulacrum opened 3 years ago

Mark-Simulacrum commented 3 years ago

This comparison was categorized as probably relevant - https://perf.rust-lang.org/compare.html?start=7611fe438dae91084d17022e705bf64374d5ba4b&end=bcfd3f7e88084850f87b8e34b4dcb9fceb872d00&stat=instructions:u - but I think it is definitely so, due to the wide range of -doc benchmarks affected.

rylev commented 3 years ago

wide range of -doc benchmarks affected

Do you want to try to come up with some threshold for the number of test case results with the same profile but with small magnitudes should be considered "definitely" relevant?

Mark-Simulacrum commented 3 years ago

I think that probably makes sense; it'll probably be worth iterating on but maybe we can start by saying something like 10 benchmarks pointing in the same direction from the same profile is enough to bump us to definitely? Not sure if 10 will feel right in the long run, but it feels like a good start.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=b6057bf7b7ee7c58e6a39ead02eaa13b75f908c2&end=c02371c442f811878ab3a0f5a813402b6dfd45d2&stat=instructions:u is a similar case where the regression consistently shows up across the many-assoc-items benchmark -- all of check,debug,opt are regressed.

It was judged "probably" rather than definitely, but seems like it should have just been in the mixed category.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=a8387aef8c378a771686878062e544af4d5e2245&end=b27661eb33c74cb514dba059b47d86b6582ac1c2&stat=instructions:u is another example -- await-call-tree shows several profiles regressing.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

rylev commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

Did this catch your eye because all of the improvements are in incremental?

https://perf.rust-lang.org/compare.html?start=b6057bf7b7ee7c58e6a39ead02eaa13b75f908c2&end=c02371c442f811878ab3a0f5a813402b6dfd45d2&stat=instructions:u is a similar case where the regression consistently shows up across the many-assoc-items benchmark -- all of check,debug,opt are regressed.

It was judged "probably" rather than definitely, but seems like it should have just been in the mixed category.

Hmm... this seems pretty border line to me. I agree it's probably worth a look, but the categorization of probably relevant seems correct over definitely relevant.

Mark-Simulacrum commented 3 years ago

https://perf.rust-lang.org/compare.html?start=d14731cb3ced8318d7fc83cbe838f0e7f2fb3b40&end=f03eb6bef8ced8a243858b819e013b9caf83d757&stat=instructions:u

Did this catch your eye because all of the improvements are in incremental?

No, await-call-tree across 3 profiles. The incremental improvements are less 'interesting' typically, particularly when looking at incr-unchanged benchmarks -- those just have less going on so smaller changes are more likely to show up as significant.

Mark-Simulacrum commented 3 years ago

Hmm... this seems pretty border line to me. I agree it's probably worth a look, but the categorization of probably relevant seems correct over definitely relevant.

Hm -- so maybe our understanding is different, but the way I see it -- if we expect a human to go investigate, then it should be autocategorized as definitely. If that investigation is short, fine, but to some extent right now we would have missed this result without perf triage reports (just limiting to perf-regression labels) which seems unfortunate.

rylev commented 3 years ago

Our understanding of these categories is definitely different. It seems you have a "when in doubt it should be in triage" perspective. The original purpose of these categories was to differentiate between comparisons where we were sure that a performance change had happened (and we just need a human to help investigate the cause and determine if the code change in question is worth accepting the performance penalty) vs. changes where we're unsure (although fairly confident) if an actual performance change has happened or if it's just noise (and we need a human to first determine whether this change is actually a performance change and not noise). This is why the names are "definitely relevant" and "probably relevant" which describe this distinction pretty well.

In particular, originally I purposefully created a higher bar for triage than for perf-regression labels. This is why anything that is labeled as "probably relevant" or "definitely relevant" gets a perf-regression label but only "definitely relevant" changes get included in triage. Triage was to be reserved for changes where we are sure that there is an actual performance change.

It sounds like perhaps you're advocating for getting rid of the "probably relevant" distinction (since we've gotten relatively good at filtering out noise) and we should only make a distinction between "definitely relevant" and "maybe relevant" the former of which we label with perf-regression labels AND note in the triage report and the former we simply ignore (as we currently do with such cases). Thoughts?

rylev commented 3 years ago

Based on a conversation with @Mark-Simulacrum it sounds like we might want to do the following (please correct me if I'm misremembering something):