Open jaredpar opened 2 months ago
The MSBuild team is also happy to support here.
Thank you for putting together the sample. I think your suggestion makes sense, I'm exploring it a bit further to understand what else would need to be changed.
One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees?
One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees?
The default experience is that generated files are embedded into the PDB produced by the build. There is no official API for this but it's in a well known part of the PDB. The code for reading that is here.
If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them. It would be something along the lines of
List<SyntaxTree> ReadAllGeneratedFiles(CompilerCall compilerCall);
There is one caveat for that approach though: it only works if the build is using portable PDBs. Native PDBs don't have a facility for that. Using native PDbs with source generators though is decently rare. Seems reasonable that for a CodeQL run you could ask that portable PDBs be generated.
There is code in that library that will very faithfully recreate a CSharpCompilation for every CompilerCall instance.
Curious if you're asking about generated files in this context: essentially how can CSharpCompilation
be faithfully generated if there are generators involved?
If so the underlying library takes a couple of different approaches:
Compilation
. In this mode no analyzer or generator is run.Thanks for the explanation.
If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them.
Yes, this is the goal. BTW, what is the reason that Roslyn doesn't expose an API on CSharpCompilation
to run these source generators and access the trees?
Curious if you're asking about generated files in this context: ...
We're injecting the following properties into the build command:
/p:UseSharedCompilation=false
/p:MvcBuildViews=true
/p:EmitCompilerGeneratedFiles=true
We're adding MvcBuildViews=true
to precompile ASP.NET MVC views. Our current solution intercepts csc calls, and as a result we also inspect these precompilation calls. I'm not sure what information we'd have in the binlog about these. (But this is probably not critical to us.) We inject EmitCompilerGeneratedFiles=true
to be able to inspect source generator produced files. We're mostly interested in the files generated from .cshtml
views. This is the main reason I was asking about source generators.
Yes, this is the goal.
Okay will add an API for that
https://github.com/jaredpar/complog/issues/125
BTW, what is the reason that Roslyn doesn't expose an API on CSharpCompilation to run these source generators and access the trees?
The underlying APIs for running the generators are available. Can see how compiler logs approaches that problem here. That provides the Compilation
after generators execute and from there you can see the generated SyntaxTree
values. They are always added to the end of the SyntaxTree
list so they can be seen by doing effectively the following:
Compilation compilation = ...;
var driver = CreateGeneratorDriver();
driver.RunGeneratorsAndUpdateCompilation(compilation, out var compilation2, out var diagnostics, cancellationToken);
var generatedSyntaxTrees = compilation2.SyntaxTrees.Skip(compilation.SyntaxTrees.Count());
That does require the host to write a bit of code though to load the analyzers, run the driver, etc ... The reason there isn't a higher level API that just takes care of all of this for you is that it's hard to find a single solution that fits all hosts. In particular analyzer loading is very complex and as a result different hosts tend to make different decisions on how to approach it. Consider that even within the Roslyn compiler analyzers are loaded with three different strategies depending on the scenario.
The reason the NuPkg I used in the sample has easy APIs for creating Compilations
, running generators, etc ... is that it's geared for a specific usage: analyzing compilations. That makes it easier to have an opinionated API on how analyzers should be loaded. It also provides the option of just not running them at all and throwing the generated source directly into the compilation.
We're adding MvcBuildViews=true to precompile ASP.NET MVC views. Our current solution intercepts csc calls, and as a result we also inspect these precompilation calls
I'm not as familiar with this scenario. I attempted to recreate it with a .NET Framework ASPNet application. Setting that property I do see it go into a MvcBuildViews
target but that runs the aspnet_compiler vs. csc.
Are you all intercepting that call? Or is there a subtly I missed where this also generates a separate csc call. There is a csc invocation but I see that with our without the -p:MvcBuildViews
We're mostly interested in the files generated from .cshtml views. This is the main reason I was asking about source generators.
Yep those are just a normal source generator at the end of the day and should behave like other generators in terms of finding the files.
I'm not that familiar with aspnet_compiler.exe
either, but I'm under the impression that it's a thin wrapper on top of some codedom based code generator, and it uses csc
to compile the generated files. I think aspnet_compiler.exe
is calling bin\roslyn\csc.exe
in the project directory. These are the arguments that I see for a specific call:
# Arguments to Roslyn: /shared /keepalive:10 /noconfig /fullpaths /warnaserror-
/t:library /utf8output /nostdlib+ /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\mscorlib.dll" /R:"C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System.Runtime\v4.0_4.0.0.0__b03f5f7f11d50a3a\System.Runtime.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Configuration.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Activation.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Xml.Linq.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\12594037\00c8d184_3aaece01\Antlr3.Runtime.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Services.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Activities.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\25149f25\124d5d3d_5b7fda01\WebApplication2.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\Microsoft.CSharp.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\46885272\00b3fcae_3a18cf01\WebGrease.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Extensions.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\167affe1\00a1802f_db0bd501\AspNet.ScriptManager.jQuery.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.Entity.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\c1742244\00a6d731_3527cf01\System.Web.Optimization.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ComponentModel.DataAnnotations.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_sxfooglg.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\9af866f8\008c3534_9ff8d401\Newtonsoft.Json.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\544178f8\0096b2ab_eaaace01\Microsoft.ScriptManager.MSAjax.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Data.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_global.asax.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\251878ba\0077a8b1_eaaace01\Microsoft.ScriptManager.WebForms.dll" /R:"C:\Windows\Microsoft.Net\assembly\GAC_MSIL\Microsoft.VisualStudio.Web.PageInspector.Loader\v4.0_1.0.0.0__b03f5f7f11d50a3a\Microsoft.VisualStudio.Web.PageInspector.Loader.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.WorkflowServices.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\6c53e78c\003be294_1a72d801\Microsoft.Web.Infrastructure.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\51b5a119\00f96fa1_2a45d401\Microsoft.CodeDom.Providers.DotNetCompilerPlatform.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.EnterpriseServices.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Drawing.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Runtime.Serialization.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Xml.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Web.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\73ffb031\00d1c97e_3d8dce01\Microsoft.AspNet.FriendlyUrls.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.ApplicationServices.dll" /R:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\assembly\dl3\aeb29fab\00a6d731_3527cf01\Microsoft.AspNet.Web.Optimization.WebForms.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.IdentityModel.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Data.DataSetExtensions.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.ServiceModel.Activities.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Core.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.DynamicData.dll" /R:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\System.Web.dll" /out:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.dll" /debug- /optimize+ /win32res:"C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\fip03p34.res" /warnaserror- /w:4 /nowarn:1659;1699;1701;612;618 /langversion:default /nowarn:1659;1699;1701 "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.0.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.1.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.2.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.3.cs" "C:\Users\TAMASVAJK\AppData\Local\Temp\Temporary ASP.NET Files\root\a76ce4fc\25014e45\App_Web_fip03p34.4.cs"
@jaredpar I started experimenting with binlog support. Here's a variant that supports source generators and another variant that shamelessly takes your sample. The latter requires fewer changes in the extractor and it's more straightforward to understand. Also, I think for your usecase with dotnet/roslyn
, there's no need for source generator support, so we might go with this option in the beginning.
This is how a codeql database can be created from the binlog:
dotnet build /t:rebuild /bl:xyz.binlog
codeql database create DB_xyz --language=csharp --build-mode=none -Obinlog=xyz.binlog
I'm testing this variant on some larger repos and comparing the alerts reported by the tracing extractor to the binlog based ones. The binlog file needs to be manually passed to codeql, which makes large scale testing difficult.
What is the long term plan for https://github.com/jaredpar/complog/? Is it going to be moved to the dotnet
or microsoft
org? Is it going to become an officially maintained product? Are there any VS features that depend on it?
Here's a https://github.com/github/codeql/pull/16581 that supports source generators and another https://github.com/github/codeql/pull/16672 that shamelessly takes your sample.
Walked through both of those changes and they look good to me :smile:
I'm testing this variant on some larger repos and comparing the alerts reported by the tracing extractor to the binlog based ones.
Very interested to hear how this works out.
What is the long term plan for https://github.com/jaredpar/complog/? Is it going to be moved to the dotnet or microsoft org? Is it going to become an officially maintained product? Are there any VS features that depend on it?
At the moment it's a tool that the C# compiler team maintains. It started as a personal project which is why it exists in my personal GitHub repo. The dotnet org already consumes the tool in a few places. But we'd hadn't moved it fully into the dotnet org because it wasn't on any production path yet. If you all ended up taking a dependency on this in codeql then we'd very likely move this into the dotnet org.
Also, I think for your usecase with dotnet/roslyn, there's no need for source generator support, so we might go with this option in the beginning.
My assumption was wrong. Roslyn uses source generators in Microsoft.CodeAnalysis.CSharp
. So, I decided to continue with the variant that runs source generators. Here's the latest iteration: https://github.com/github/codeql/pull/16747.
@tamasvajk your injecting EmitCompilerGeneratedFiles
into build pipelines absolutely breaks production builds (at least for us in Azure Data)
c:\program files\microsoft visual studio\2022\enterprise\msbuild\current\bin\Roslyn\Microsoft.Managed.Core.targets(336,5): error : EmitCompilerGeneratedFiles was true, but no CompilerGeneratedFilesOutputPath was provided. CompilerGeneratedFilesOutputPath must be set in order to emit generated files. [D:\a\_work\1\s\Platform\...\MyProj.sqlproj]
You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated
as well
So, I decided to continue with the variant that runs source generators
Is there a reason why you have to run generators again vs. just looking at the generated files from the build that created the binary log? The compiler log code makes those easy to extract, assuming portable PDB.
You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well
Curious: why do you all need these files? If you're re-running generators then the generated files will be in the Compilation
object. Trying to understand why also having them on disk is beneficial.
@tamasvajk your injecting EmitCompilerGeneratedFiles into build pipelines absolutely breaks production builds (at least for us in Azure Data)
You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well
Yes, in certain cases, injecting EmitCompilerGeneratedFiles
into the build command can break the build. Using the binary logs for extraction would/will solve this problem. At the same time, I don't think this issue should be used to discuss this problem. As far as I see, it's already causing some confusion. (I think we should continue discussing the EmitCompilerGeneratedFiles
problem in the already started teams chat, ICM issue, or a new issue here in github/codeql
.)
So, I decided to continue with the variant that runs source generators
Is there a reason why you have to run generators again vs. just looking at the generated files from the build that created the binary log? The compiler log code makes those easy to extract, assuming portable PDB.
I continued with the variant in https://github.com/github/codeql/pull/16747 that also processes the output of source generators (in contrast to https://github.com/github/codeql/pull/16672, which doesn't). We get the generated ASTs from compilationData.GetCompilationAfterGenerators()
. As far as I understand, GetCompilationAfterGenerators
is actually running the source generators in this case. Is there a better/another way of getting to the generated ASTs from an input binlog file?
You guys should really be injecting /p:CompilerGeneratedFilesOutputPath=Generated as well
Curious: why do you all need these files? If you're re-running generators then the generated files will be in the Compilation object. Trying to understand why also having them on disk is beneficial.
I think there's a mix of contexts here. @KirillOsenkov is suggesting that we should be injecting the /p:CompilerGeneratedFilesOutputPath=Generated
property into the build command with our normal autobuild/tracing solution. The tracing solution is not running any of the source generators, instead it adds the generated files to the compiled sources, see here. In this context we do need to force the /p:EmitCompilerGeneratedFiles=true
so that the files are written to the disk. @KirillOsenkov found that in certain cases /p:EmitCompilerGeneratedFiles=true
is not enough, and we should be also adding /p:CompilerGeneratedFilesOutputPath=Generated
too.
In the context of binary log extraction, we don't need any of the injected properties. Actually, with the binary log based extractor, we're not even starting our tracer as there's no build command that we'd want to trace.
I think we should continue discussing the EmitCompilerGeneratedFiles problem in the already started teams chat,
Is this a teams chat that I can join? Always interested in how adding properties to the build impact the results unexpectedly.
Is there a better/another way of getting to the generated ASTs from an input binlog file?
More recent versions of Basic.CompilerLog.Util
have a method BinaryLogReader.ReadAllGeneratedFiles. That method will just look into the portable PDB and read out the generated files without re-running them.
It only works when portable PDBs are used as native PDBs don't support this function. That is how the vast majority of compilations work though. Could possibly save a bit of time by having that as the fast path and falling back to the GetCompilationAfterGenerators
if it's not a portable PDB.
@jaredpar I briefly looked into using BinaryLogReader.ReadAllGeneratedFiles
. I see that I could use it instead of looking for the generated files in compilerArguments.GeneratedFilesOutputDirectory
. This would play nicely with the approach in https://github.com/github/codeql/pull/16672. But is there a way to incorporate this in https://github.com/github/codeql/pull/16747, which relies on Basic.CompilerLog.Util
to retrieve the CSharpCompilation
objects? Is there a Basic.CompilerLog.Util
API that takes the output of ReadAllGeneratedFiles
and produces an "updated" compilation from it?
The nature of the C# driver means that it needs to see all csc invocations for a build. Today that is achieved by disabling shared compilation during build. That unfortunately creates a significant performance penalty when building C# code. The larger the repo the more significant the penalty. The rule of thumb is that for solutions of at least medium size this will cause somewhere from a 3-4X build slow down. For larger repositories this can be even bigger.
Consider as a concrete example the dotnet/runtime repository. When their CodeQL pipeline runs and shared compilation is disabled it increases their build time by 1 hour and 38 minutes. Or alternatively building the dotnet/roslyn repository. Building that locally results in a ~4X perf penalty when shared compilation is disabled.
I understand from previous conversations that this is done because the extractor needs to see every C# compiler call in the build and potentially process it. An alternative approach to doing this is to have the build produce a binary log. That is a non-invasive change to builds and there is ample tooling for reading C# compiler calls from those once the build completes. The code for reading compiler calls is at the core very straight forward:
I have a sample here of integrating this approach into the existing extractor tool. On my machine I was able to use this sample to run over a full build of roslyn without any obvious issues. And most importantly, no changes to how I built roslyn :smile:
Using libraries like this have other advantages because you could also offload the work of creating
CSharpCompilation
objects. There is code in that library that will very faithfully recreate aCSharpCompilation
for everyCompilerCall
instance. Furthermore, think there are other optimization opportunities for the extractor once you are running the analysis in bulk: cachingMetadataReferences
, cachingSourceText
, etc ...More than happy to chat about this, the different ways binary logs can be used, moving analysis off machine, etc ... My big end goal is to get back to us using shared compilation so we can keep our build times down.