microsoft / MSBuildSdks

MSBuild project SDKs
MIT License
459 stars 82 forks source link

Tests are not truly parallelized when run with Traversal project #488

Open batkaevruslan opened 1 year ago

batkaevruslan commented 1 year ago

Hi! For the same set of test projects, tests that are run from .sln file are executed faster than tests, that are run from traversal project file.

How it looks in Process Explorer: For a solution with dozens of test projects, when tests are run from traversal project, they are split among nodes, but when a node finishes its tests the node does not "steal" work from other nodes. As a result, one node does the most part of the work. image

How to reproduce: See instructions in https://github.com/batkaevruslan/dotCoverParallelizationIssue (ignore repo name, as at first we thought it's dotCover fault). We tested it on several machines\OS with sdk 7.0.401 and got similar results:

~/Repos/batkaevruslan/dotCoverParallelizationIssue (main*) » /usr/bin/time -h dotnet test dotCoverParallelizationIssue.sln -maxcpucount:2
Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests1.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests3.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests4.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests5.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 10 s - Tests6.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 30 s - Tests2.dll (net7.0) 31.71s real 6.47s user 1.51s sys

~/Repos/batkaevruslan/dotCoverParallelizationIssue (main*) » /usr/bin/time -h dotnet test affected.proj -maxcpucount:2 -v:q
Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests1.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests4.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 10 s - Tests6.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 30 s - Tests2.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests3.dll (net7.0) Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: 1 s - Tests5.dll (net7.0) 35.04s real 6.28s user 1.70s sys

Pay attention to the second run: Tests3.dll and Tests5.dll could have been executed in parallel with Tests2.dll, but they haven't.

jeffkl commented 1 year ago

I believe what's happening is that when you run dotnet test against a solution file, it schedules the build differently. It appears that the VSTest target for dotnet test doesn't depend on Build so the build of each project happens as each test run is scheduled. In some scenarios that might be better and other scenarios it would be slower.

I cloned the repo and ran dotnet test /m:2 against the solution file and the traversal project:

Solution File Traversal project
image image

Technically both executions are doing the same amount of work, just scheduled differently. To get traversal to match the behavior, you can build first with dotnet build and then run dotnet test --no-build:

image

Now the test execution is identical. That will work if you are okay with building first and executing tests separately.

The VSTestDependsOn property is invoking Build first here: https://github.com/microsoft/MSBuildSdks/blob/486c3e01da8bb224d8c9baeea818b88f72b829a7/src/Traversal/Sdk/Traversal.targets#L80-L82

Currently, the only way to override it is by doing so in a custom after Traversal target:

Directory.Build.props

<Project>
  <PropertyGroup>
    <CustomAfterTraversalTargets>$(MSBuildThisFileDirectory)\Custom.After.Traversal.targets</CustomAfterTraversalTargets>
  </PropertyGroup>
</Project>

Custom.After.Traversal.targets:

<Project>
  <PropertyGroup>
    <VSTestDependsOn Condition="'$(UsingMicrosoftTraversalSdk)' == 'true'"></VSTestDependsOn>
  </PropertyGroup>
</Project>

We could add a property to set so that Traversal doesn't call Build first if necessary.

batkaevruslan commented 1 year ago

@jeffkl , thanks for quick response! After reading your comment, I realized that my repro project is not complete. A month ago, we noticed problems while running tests with dotCover. At first, we thought that it is dotCover problem and create issue for JB team, and then we thought that it is traversal project problem and created current thread issue.

Please see attached to https://youtrack.jetbrains.com/issue/DCVR-12167/dotCover-and-parallel-multiple-test-assemblies-execution video: first 10 secs and last 60 is enough to see that we do specify --no-build and that only one node is working at the end.

Today I was not able to reproduce the same problem behavior with --no-build parameter on my repro project and even for our product solution. The last fact makes me think that it is some kind of flaky behavior.

jeffkl commented 1 year ago

The Traversal logic is doing simple MSBuild scheduling by calling the <MSBuild /> task with the specified projects and indicating they should be built in parallel. MSBuild then takes over and schedules project builds on an available node. I highly doubt that there's something wrong with Traversal and instead would recommend looking around https://github.com/dotnet/msbuild/issues for an answer. If something is found to be wrong with Traversal, we'll of course be happy to dig deeper but since Traversal is just calling into MSBuild I think you'll find more answer there.