Number of syntax trees in project sometimes zero

duncanawoods commented 4 years ago

I am often and unpredictably finding no syntax trees in a project. The solution has 10 projects and on around half the executions, 2 or 3 of the projects (different each time) will return no classes.

public static async Task<string> ParseInfo(string solutionFile)
{
    var manager = new AnalyzerManager(solutionFile);
    var workspace = manager.GetWorkspace();

    foreach (var project in workspace
        .CurrentSolution
        .Projects)
    {
        var compilation = await project.GetCompilationAsync();
        var count = 0;

        foreach (var tree in compilation.SyntaxTrees)
        {
            count +=
                tree.GetRoot()
                    .DescendantNodes()
                    .OfType<ClassDeclarationSyntax>()
                    .Count();
        }

        Console.WriteLine($"Project: {project.Name} Class Count: {count}");
    }
}

An example sequence of analyses might find this many total classes depending on which projects failed to produce any syntax trees:

Run	Class count
0	854
1	854
2	854
3	854
4	727
5	831
6	693
7	854
8	854
9	853
10	840
11	854
12	854
13	854
14	854
15	831
16	854
17	854
18	854
19	854

BuildAlyzer 2.6.0
Ubuntu 18.04
.NET Core SDK (3.1.202)

duncanawoods commented 4 years ago

Ok looks like the same thread safety issue as #134. It's fixed by disabling parallel builds.

I have identified one issue so far. AnalyzerResults is updated from multiple threads and has no concurrency support. I have updated it from Dictionary to ConcurrentDictionary but this has not solved all issues.

duncanawoods commented 4 years ago

@daveaglick

The cause is that AnalyzerResult.ProcessCscCommandLine is never called because EventProcessor.MessageRaised is never called.

I believe there is a race condition between the AnonymousPipeLoggerServer reading from the buffer and it's CancellationTokenSource being triggered when the build process exits. I think that the buffer reading is cancelled before it has completed.

It appears that it isn't actually necessary to trigger CancellationTokenSource at all. The buffer reading will terminate on it's own when the pipe is closed so the Process.Exited event can just be removed.

private IAnalyzerResults BuildTargets(BuildEnvironment buildEnvironment, string targetFramework, string[] targetsToBuild, AnalyzerResults results)
{
    using (CancellationTokenSource cancellation = new CancellationTokenSource())
    {
        using (AnonymousPipeLoggerServer pipeLogger = new AnonymousPipeLoggerServer(cancellation.Token))
        {
            using (EventProcessor eventProcessor = new EventProcessor(Manager, this, BuildLoggers, pipeLogger, results != null))
            {
                // Run MSBuild
                int exitCode;
                string fileName = GetCommand(buildEnvironment, targetFramework, targetsToBuild, pipeLogger.GetClientHandle(), out string arguments);
                using (ProcessRunner processRunner = new ProcessRunner(fileName, arguments, Path.GetDirectoryName(ProjectFile.Path), GetEffectiveEnvironmentVariables(buildEnvironment), Manager.LoggerFactory))
                {
                    processRunner.Exited = () => cancellation.Cancel(); // <--- RACE CONDITION
                    processRunner.Start();
                    pipeLogger.ReadAll();                               // <----
                    processRunner.WaitForExit();
                    exitCode = processRunner.ExitCode;
                }

                // Collect the results
                results?.Add(eventProcessor.Results, exitCode == 0 && eventProcessor.OverallSuccess);
            }
        }
    }
    return results;
}

When I comment out the cancellation, I now see reliable results.

daveaglick commented 4 years ago

When I comment out the cancellation, I now see reliable results.

Awesome! Thanks for tracking this down to a root cause - when I initially tried to reproduce I couldn't seem to get the same results so I put it on hold while working on other stuff and intended to come back later. Glad you beat me to it :)

I'll try my best to apply this as a fix and get out a new version this weekend or early next week.

duncanawoods commented 4 years ago

Thanks @daveaglick

i) also see #141 for a fix for that one.

ii) Changing AnalyzerResults to use a ConcurrentDictionary might fix the missing projects seen in #134.

I initially tried to reproduce I couldn't seem to get the same results

iii) To help replicate this, I generated a liltte script to create big solutions with lots of tiny projects and I was able to see lots of failures. I only need a dozen or so projects:

var sln = $"bigtest";

Console.WriteLine($"dotnet new sln -n {sln}");

for (int i = 0; i < 50; i++)
{    
    Console.WriteLine($"dotnet new console -n Proj{i}");
    Console.WriteLine($"dotnet sln ./{sln}.sln add ./Proj{i}/Proj{i}.csproj");
}

daveaglick commented 4 years ago

I'm going to add a test solution using your script to the test suite and it will be awesome :)

daveaglick commented 4 years ago

This test is now consistently passing for me with a 50 project solution:

Hopefully that's an indication this is resolved. Thanks again for the fantastic work narrowing down the multiple concurrency issues here @duncanawoods.

daveaglick commented 4 years ago

Buildalyzer 3.0.1 is now rolling out to NuGet and should (hopefully) resolve this issue. When you get a chance can you please verify that this problem is resolved? Thanks!

duncanawoods commented 4 years ago

Hi @daveaglick

I have had a chance to try it out. When noodling around, I was seeing it around 5% of the time. I have also seen some cases where a typically 15s operation took 10 minutes to complete.

I created a batch test with better data collection to share some hard stats with you but this didn't even appear (!) and I saw some different errors.

Test setup

a process that ran an analysis of a 14 project solution with 863 classes
Ubuntu 18.04
run 20 processes at a time
repeat 20 times

Result	Number	%
Success	385	96.25%
Could not find build environment	15	3.75%
Missing classes	0	0

Observations

I don't know what environmental conditions might increase/decrease the likelihood of this specific issue
Even though they are separate processes, the tests within a batch do not seem entirely independent. The "Could not find build environment" appeared in groups of 5 or so.
Trying to do 100 process batches led to lots of errors about "closed pipes". Must be some system resource thing but I am surprised on a 12C/24T, 64GB RAM

Conclusions

I think that while we have improved things, there are still gremlins in the concurrency code and given their rarity we are going to have a terrible time replicating let alone fixing them.

In this sort of situation my approach would be to take a clean-slate approach the concurrency design. Right now pretty much every form of parallelism is used: sub-processes, threads, async/await, TPL loops and concurrent data-structures. One concern is that peppering code with concurrent data-structures is insufficient for operations that touch multiple data-structures. What needs to happen is ensure that entire operations are isolated from each other instead of just fine-grained access to specific data. To make concurrency comprehensible and checkable for errors, I would aim to have the majority of code unaware of concurrency concerns and push all the orchestration up to the top level with crystal clear guarantees about who is accessing what, when things are started/terminated.

phmonte / Buildalyzer