Open tangxifan opened 1 year ago
@Tulong4Dev Can you provide a summary on all the QoR failures? We would like to identify how many of them are actually performance degradation.
If disable tbb on google runners, all regression can pass without QoR differences: https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/4695120147
Also, after setting the right tbblib in CI machines, we can see the tbb is running and all tests are green.
Conclusion: TBB is the root cause of all QoR failures. @vaughnbetz Please suggest which direction should we go: a. Disable TBB on Google runners; b. Enable TBB and analyze/update QoR result.
Short answer: I think turning off TBB for now is a good solution then.
Long answer: This is a bit strange. The CPU time differences could be due to TBB having some overhead to start up threads, which is a net loss on very small designs. But the fact that a few failures due to a different routing channel width etc. disappeared implies there is more than that -- we're getting different results with TBB on vs. off. That shouldn't be the case (the timing analyzer should get the same answer with and without parallelism, and it's the only parallel part of VPR right now). @duck2 : you've been working with TBB -- do we get different results with an without TBB? I think we need an issue to track that and you're a natural person to own it.
Adding @duck2 to this one as well. Fahri, it seems TBB is causing QoR changes in some cases. Are we getting different results with TBB on vs. off? We should get the same answer, just with a different CPU time.
Thanks @Tulong4Dev for the research. @vaughnbetz Should we turn off TBB for now, since green CIs is our ishort-term and immediate goal? We can study the impact of TBB in other PRs? Waiting for your decision. Thanks!
Yes, let's turn off TBB for now and figure out the TBB problem later. This issue can track the longer term investigation.
Expected Behaviour
QoR checks should always pass on any PR to merge and the master branch.
Current Behaviour
However, due to an emergency on migrating to Ubuntu 22.04 (see detailed in #2257 ), some QoR failures are seen on the master branch:
All the QoR failures are seen on custom CI runners, while none of them are on github-hosted runners
Possible Solution
We should resolve the QoR failures. Here are my suggestion:
Steps to Reproduce
Context
Your Environment