github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.3k stars 1.46k forks source link

Using binary logs to significantly increase CodeQL analysis performance for C# #16346

Open jaredpar opened 2 months ago

jaredpar commented 2 months ago

The nature of the C# driver means that it needs to see all csc invocations for a build. Today that is achieved by disabling shared compilation during build. That unfortunately creates a significant performance penalty when building C# code. The larger the repo the more significant the penalty. The rule of thumb is that for solutions of at least medium size this will cause somewhere from a 3-4X build slow down. For larger repositories this can be even bigger.

Consider as a concrete example the dotnet/runtime repository. When their CodeQL pipeline runs and shared compilation is disabled it increases their build time by 1 hour and 38 minutes. Or alternatively building the dotnet/roslyn repository. Building that locally results in a ~4X perf penalty when shared compilation is disabled.

I understand from previous conversations that this is done because the extractor needs to see every C# compiler call in the build and potentially process it. An alternative approach to doing this is to have the build produce a binary log. That is a non-invasive change to builds and there is ample tooling for reading C# compiler calls from those once the build completes. The code for reading compiler calls is at the core very straight forward:

using var fileStream = new FileStream(binlogPath, FileMode.Open, FileAccess.Read, FileShare.Read);
List<CompilerCall> compilerCalls = BinaryLogUtil.ReadAllCompilerCalls(fileStream);

I have a sample here of integrating this approach into the existing extractor tool. On my machine I was able to use this sample to run over a full build of roslyn without any obvious issues. And most importantly, no changes to how I built roslyn :smile:

Using libraries like this have other advantages because you could also offload the work of creating CSharpCompilation objects. There is code in that library that will very faithfully recreate a CSharpCompilation for every CompilerCall instance. Furthermore, think there are other optimization opportunities for the extractor once you are running the analysis in bulk: caching MetadataReferences, caching SourceText, etc ...

More than happy to chat about this, the different ways binary logs can be used, moving analysis off machine, etc ... My big end goal is to get back to us using shared compilation so we can keep our build times down.

rainersigwald commented 2 months ago

The MSBuild team is also happy to support here.

tamasvajk commented 1 month ago

Thank you for putting together the sample. I think your suggestion makes sense, I'm exploring it a bit further to understand what else would need to be changed.

One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees?

jaredpar commented 1 month ago

One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees?

The default experience is that generated files are embedded into the PDB produced by the build. There is no official API for this but it's in a well known part of the PDB. The code for reading that is here.

If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them. It would be something along the lines of

List<SyntaxTree> ReadAllGeneratedFiles(CompilerCall compilerCall);

There is one caveat for that approach though: it only works if the build is using portable PDBs. Native PDBs don't have a facility for that. Using native PDbs with source generators though is decently rare. Seems reasonable that for a CodeQL run you could ask that portable PDBs be generated.

There is code in that library that will very faithfully recreate a CSharpCompilation for every CompilerCall instance.

Curious if you're asking about generated files in this context: essentially how can CSharpCompilation be faithfully generated if there are generators involved?

If so the underlying library takes a couple of different approaches:

  1. It just loads the analyzers / generators into memory and runs them exactly as the build did. In that case the generated files are produced when they're asked for.
  2. It just reads the generated files from the PDB and puts them into the correct place in the Compilation. In this mode no analyzer or generator is run.
tamasvajk commented 1 month ago

Thanks for the explanation.

If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them.

Yes, this is the goal. BTW, what is the reason that Roslyn doesn't expose an API on CSharpCompilation to run these source generators and access the trees?

Curious if you're asking about generated files in this context: ...

We're injecting the following properties into the build command:

We're adding MvcBuildViews=true to precompile ASP.NET MVC views. Our current solution intercepts csc calls, and as a result we also inspect these precompilation calls. I'm not sure what information we'd have in the binlog about these. (But this is probably not critical to us.) We inject EmitCompilerGeneratedFiles=true to be able to inspect source generator produced files. We're mostly interested in the files generated from .cshtml views. This is the main reason I was asking about source generators.

jaredpar commented 1 month ago

Yes, this is the goal.

Okay will add an API for that

https://github.com/jaredpar/complog/issues/125

BTW, what is the reason that Roslyn doesn't expose an API on CSharpCompilation to run these source generators and access the trees?

The underlying APIs for running the generators are available. Can see how compiler logs approaches that problem here. That provides the Compilation after generators execute and from there you can see the generated SyntaxTree values. They are always added to the end of the SyntaxTree list so they can be seen by doing effectively the following:

Compilation compilation = ...;
var driver = CreateGeneratorDriver();
driver.RunGeneratorsAndUpdateCompilation(compilation, out var compilation2, out var diagnostics, cancellationToken);
var generatedSyntaxTrees = compilation2.SyntaxTrees.Skip(compilation.SyntaxTrees.Count());

That does require the host to write a bit of code though to load the analyzers, run the driver, etc ... The reason there isn't a higher level API that just takes care of all of this for you is that it's hard to find a single solution that fits all hosts. In particular analyzer loading is very complex and as a result different hosts tend to make different decisions on how to approach it. Consider that even within the Roslyn compiler analyzers are loaded with three different strategies depending on the scenario.

The reason the NuPkg I used in the sample has easy APIs for creating Compilations, running generators, etc ... is that it's geared for a specific usage: analyzing compilations. That makes it easier to have an opinionated API on how analyzers should be loaded. It also provides the option of just not running them at all and throwing the generated source directly into the compilation.

We're adding MvcBuildViews=true to precompile ASP.NET MVC views. Our current solution intercepts csc calls, and as a result we also inspect these precompilation calls

I'm not as familiar with this scenario. I attempted to recreate it with a .NET Framework ASPNet application. Setting that property I do see it go into a MvcBuildViews target but that runs the aspnet_compiler vs. csc.

image

Are you all intercepting that call? Or is there a subtly I missed where this also generates a separate csc call. There is a csc invocation but I see that with our without the -p:MvcBuildViews

We're mostly interested in the files generated from .cshtml views. This is the main reason I was asking about source generators.

Yep those are just a normal source generator at the end of the day and should behave like other generators in terms of finding the files.

tamasvajk commented 1 month ago

I'm not that familiar with aspnet_compiler.exe either, but I'm under the impression that it's a thin wrapper on top of some codedom based code generator, and it uses csc to compile the generated files. I think aspnet_compiler.exe is calling bin\roslyn\csc.exe in the project directory. These are the arguments that I see for a specific call:

# Arguments to Roslyn: /shared /keepalive:10 /noconfig /fullpaths /warnaserror-
/t:library /utf8output /nostdlib+ /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\mscorlib.dll" /R:"C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Runtime\v4.0_4.0.0.0__b03f5f7f11d50a3a\System.Runtime.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Configuration.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Activation.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Xml.Linq.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\12594037\00c8d184_3aaece01\Antlr3.Runtime.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Services.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Activities.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\25149f25\124d5d3d_5b7fda01\WebApplication2.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\Microsoft.CSharp.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\46885272\00b3fcae_3a18cf01\WebGrease.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Extensions.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\167affe1\00a1802f_db0bd501\AspNet.ScriptManager.jQuery.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Entity.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\c1742244\00a6d731_3527cf01\System.Web.Optimization.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ComponentModel.DataAnnotations.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_sxfooglg.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\9af866f8\008c3534_9ff8d401\Newtonsoft.Json.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\544178f8\0096b2ab_eaaace01\Microsoft.ScriptManager.MSAjax.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Data.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_global.asax.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\251878ba\0077a8b1_eaaace01\Microsoft.ScriptManager.WebForms.dll" /R:"C:\Windows\Microsoft.Net\assembly\GAC_MSIL\Microsoft.VisualStudio.Web.PageInspector.Loader\v4.0_1.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.Web.PageInspector.Loader.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.WorkflowServices.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\6c53e78c\003be294_1a72d801\Microsoft.Web.Infrastructure.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\51b5a119\00f96fa1_2a45d401\Microsoft.CodeDom.Providers.DotNetCompilerPlatform.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.EnterpriseServices.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Drawing.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Runtime.Serialization.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Xml.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Web.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\73ffb031\00d1c97e_3d8dce01\Microsoft.AspNet.FriendlyUrls.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.ApplicationServices.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\aeb29fab\00a6d731_3527cf01\Microsoft.AspNet.Web.Optimization.WebForms.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.IdentityModel.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Data.DataSetExtensions.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Activities.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Core.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.DynamicData.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.dll" /out:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.dll" /debug- /optimize+ /win32res:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\fip03p34.res" /warnaserror- /w:4 /nowarn:1659;1699;1701;612;618 /langversion:default /nowarn:1659;1699;1701  "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.0.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.1.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.2.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.3.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.4.cs"
tamasvajk commented 3 weeks ago

@jaredpar I started experimenting with binlog support. Here's a variant that supports source generators and another variant that shamelessly takes your sample. The latter requires fewer changes in the extractor and it's more straightforward to understand. Also, I think for your usecase with dotnet/roslyn, there's no need for source generator support, so we might go with this option in the beginning.

This is how a codeql database can be created from the binlog:

dotnet build /t:rebuild /bl:xyz.binlog
codeql database create DB_xyz --language=csharp --build-mode=none -Obinlog=xyz.binlog

I'm testing this variant on some larger repos and comparing the alerts reported by the tracing extractor to the binlog based ones. The binlog file needs to be manually passed to codeql, which makes large scale testing difficult.

What is the long term plan for https://github.com/jaredpar/complog/? Is it going to be moved to the dotnet or microsoft org? Is it going to become an officially maintained product? Are there any VS features that depend on it?

jaredpar commented 2 weeks ago

Here's a https://github.com/github/codeql/pull/16581 that supports source generators and another https://github.com/github/codeql/pull/16672 that shamelessly takes your sample.

Walked through both of those changes and they look good to me :smile:

I'm testing this variant on some larger repos and comparing the alerts reported by the tracing extractor to the binlog based ones.

Very interested to hear how this works out.

What is the long term plan for https://github.com/jaredpar/complog/? Is it going to be moved to the dotnet or microsoft org? Is it going to become an officially maintained product? Are there any VS features that depend on it?

At the moment it's a tool that the C# compiler team maintains. It started as a personal project which is why it exists in my personal GitHub repo. The dotnet org already consumes the tool in a few places. But we'd hadn't moved it fully into the dotnet org because it wasn't on any production path yet. If you all ended up taking a dependency on this in codeql then we'd very likely move this into the dotnet org.

tamasvajk commented 1 week ago

Also, I think for your usecase with dotnet/roslyn, there's no need for source generator support, so we might go with this option in the beginning.

My assumption was wrong. Roslyn uses source generators in Microsoft.CodeAnalysis.CSharp. So, I decided to continue with the variant that runs source generators. Here's the latest iteration: https://github.com/github/codeql/pull/16747.

KirillOsenkov commented 1 week ago

@tamasvajk your injecting EmitCompilerGeneratedFiles into build pipelines absolutely breaks production builds (at least for us in Azure Data)

c:\program files\microsoft visual studio\2022\enterprise\msbuild\current\bin\Roslyn\Microsoft.Managed.Core.targets(336,5): error : EmitCompilerGeneratedFiles was true, but no CompilerGeneratedFilesOutputPath was provided. CompilerGeneratedFilesOutputPath must be set in order to emit generated files. [D:\a\_work\1\s\Platform\...\MyProj.sqlproj]
KirillOsenkov commented 1 week ago

You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well

jaredpar commented 5 days ago

So, I decided to continue with the variant that runs source generators

Is there a reason why you have to run generators again vs. just looking at the generated files from the build that created the binary log? The compiler log code makes those easy to extract, assuming portable PDB.

jaredpar commented 5 days ago

You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well

Curious: why do you all need these files? If you're re-running generators then the generated files will be in the Compilation object. Trying to understand why also having them on disk is beneficial.

tamasvajk commented 5 days ago

@tamasvajk your injecting EmitCompilerGeneratedFiles into build pipelines absolutely breaks production builds (at least for us in Azure Data)

You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well

Yes, in certain cases, injecting EmitCompilerGeneratedFiles into the build command can break the build. Using the binary logs for extraction would/will solve this problem. At the same time, I don't think this issue should be used to discuss this problem. As far as I see, it's already causing some confusion. (I think we should continue discussing the EmitCompilerGeneratedFiles problem in the already started teams chat, ICM issue, or a new issue here in github/codeql.)

So, I decided to continue with the variant that runs source generators

Is there a reason why you have to run generators again vs. just looking at the generated files from the build that created the binary log? The compiler log code makes those easy to extract, assuming portable PDB.

I continued with the variant in https://github.com/github/codeql/pull/16747 that also processes the output of source generators (in contrast to https://github.com/github/codeql/pull/16672, which doesn't). We get the generated ASTs from compilationData.GetCompilationAfterGenerators(). As far as I understand, GetCompilationAfterGenerators is actually running the source generators in this case. Is there a better/another way of getting to the generated ASTs from an input binlog file?

You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well

Curious: why do you all need these files? If you're re-running generators then the generated files will be in the Compilation object. Trying to understand why also having them on disk is beneficial.

I think there's a mix of contexts here. @KirillOsenkov is suggesting that we should be injecting the /p:CompilerGeneratedFilesOutputPath=Generated property into the build command with our normal autobuild/tracing solution. The tracing solution is not running any of the source generators, instead it adds the generated files to the compiled sources, see here. In this context we do need to force the /p:EmitCompilerGeneratedFiles=true so that the files are written to the disk. @KirillOsenkov found that in certain cases /p:EmitCompilerGeneratedFiles=true is not enough, and we should be also adding /p:CompilerGeneratedFilesOutputPath=Generated too.

In the context of binary log extraction, we don't need any of the injected properties. Actually, with the binary log based extractor, we're not even starting our tracer as there's no build command that we'd want to trace.

jaredpar commented 4 days ago

I think we should continue discussing the EmitCompilerGeneratedFiles problem in the already started teams chat,

Is this a teams chat that I can join? Always interested in how adding properties to the build impact the results unexpectedly.

Is there a better/another way of getting to the generated ASTs from an input binlog file?

More recent versions of Basic.CompilerLog.Util have a method BinaryLogReader.ReadAllGeneratedFiles. That method will just look into the portable PDB and read out the generated files without re-running them.

It only works when portable PDBs are used as native PDBs don't support this function. That is how the vast majority of compilations work though. Could possibly save a bit of time by having that as the fast path and falling back to the GetCompilationAfterGenerators if it's not a portable PDB.

tamasvajk commented 3 days ago

@jaredpar I briefly looked into using BinaryLogReader.ReadAllGeneratedFiles. I see that I could use it instead of looking for the generated files in compilerArguments.GeneratedFilesOutputDirectory. This would play nicely with the approach in https://github.com/github/codeql/pull/16672. But is there a way to incorporate this in https://github.com/github/codeql/pull/16747, which relies on Basic.CompilerLog.Util to retrieve the CSharpCompilation objects? Is there a Basic.CompilerLog.Util API that takes the output of ReadAllGeneratedFiles and produces an "updated" compilation from it?