QoR failures due to migration to Ubuntu 22.04

tangxifan commented 1 year ago

Expected Behaviour

QoR checks should always pass on any PR to merge and the master branch.

Current Behaviour

However, due to an emergency on migrating to Ubuntu 22.04 (see detailed in #2257 ), some QoR failures are seen on the master branch:

vtr_reg_nightly_test1_odin
vtr_reg_nightly_test2_odin
vtr_reg_nightly_test1
vtr_reg_nightly_test2
vtr_reg_strong
vtr_reg_strong_odin

All the QoR failures are seen on custom CI runners, while none of them are on github-hosted runners

Possible Solution

We should resolve the QoR failures. Here are my suggestion:

Review each failure and log any QoR degradation.
For any QoR degradation, we should discuss and see if it can be waived or not.
For any QoR upgrade, we can simply update the related golden_results.txt and make QoR pass.

Steps to Reproduce

Git clone current master
Run regression tests:

./run_reg_test.py <TEST_NAME>

vtr_reg_nightly_test1_odin
vtr_reg_nightly_test2_odin
vtr_reg_nightly_test1
vtr_reg_nightly_test2
vtr_reg_strong
vtr_reg_strong_odin

Context

Your Environment

VTR revision used: ab2e17f
Operating System and version: Ubuntu 22.04
Compiler version: Gcc-11

tangxifan commented 1 year ago

@Tulong4Dev Can you provide a summary on all the QoR failures? We would like to identify how many of them are actually performance degradation.

Tulong4Dev commented 1 year ago

@tangxifan Here is the Summary of Latest Log: https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/4646382044

vtr_reg_nightly_test1_odin.csv vtr_reg_nightly_test1.csv vtr_reg_strong_odin1.csv vtr_reg_nightly_test2_odin.csv vtr_reg_nightly_test2.csv vtr_reg_strong.csv

Tulong4Dev commented 1 year ago

If disable tbb on google runners, all regression can pass without QoR differences: https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/4695120147

Also, after setting the right tbblib in CI machines, we can see the tbb is running and all tests are green.

Conclusion: TBB is the root cause of all QoR failures. @vaughnbetz Please suggest which direction should we go: a. Disable TBB on Google runners; b. Enable TBB and analyze/update QoR result.

vaughnbetz commented 1 year ago

Short answer: I think turning off TBB for now is a good solution then.

Long answer: This is a bit strange. The CPU time differences could be due to TBB having some overhead to start up threads, which is a net loss on very small designs. But the fact that a few failures due to a different routing channel width etc. disappeared implies there is more than that -- we're getting different results with TBB on vs. off. That shouldn't be the case (the timing analyzer should get the same answer with and without parallelism, and it's the only parallel part of VPR right now). @duck2 : you've been working with TBB -- do we get different results with an without TBB? I think we need an issue to track that and you're a natural person to own it.

vaughnbetz commented 1 year ago

Adding @duck2 to this one as well. Fahri, it seems TBB is causing QoR changes in some cases. Are we getting different results with TBB on vs. off? We should get the same answer, just with a different CPU time.

tangxifan commented 1 year ago

Thanks @Tulong4Dev for the research. @vaughnbetz Should we turn off TBB for now, since green CIs is our ishort-term and immediate goal? We can study the impact of TBB in other PRs? Waiting for your decision. Thanks!

vaughnbetz commented 1 year ago

Yes, let's turn off TBB for now and figure out the TBB problem later. This issue can track the longer term investigation.

verilog-to-routing / vtr-verilog-to-routing