SizeBench has had perennial problems with performance for various customers, especially those with large binaries. Recently a binary was found with a pathologically bad case that took 17 hours to process which is obviously absurd - so I'm taking the time to invest in some significant core performance-focused changes to improve performance for most use cases.
Briefly summarize what changed
There's a variety of changes in this PR, found as I kept profiling and trying things until eventually performance trade-offs seemed right:
Reducing allocations by using [ThreadStatic] on some StringBuider instances so we can re-use their internal char[] buffers over and over as we continually conjure up names for symbols and things like that.
When we go through various DIA enumerators, we have been using the IEnumVARIANT built-in marshaler but this is very chatty between managed and native. DIA's enumerators expose "batch" methods that need a little custom coding to take advantage of - if we pin an array of IntPtr we can pass that down to have it collect an entire "chunk" of things at once, then much more quickly cast these to COM interfaces up in the managed layer. This significantly reduces overhead of large enumerations.
SessionDataCache changed many sorted collections to unsorted so we don't pay for the insertion cost of inserting into sorted collections too much. The value of them being sorted wasn't worth it.
When looking up a symbol's placement in the binary, we now provide options to let the caller skip the very expensive source file load if they don't need that (e.g. BinaryBytes doesn't need this for the error it wishes to show)
RVARangeSet can be made faster during coalescing (which happens a lot in large binaries) because we can pre-size our allocations for example.
The .NET SDK has updated and has newer Roslyn rules enforced so I also just cleaned up some things to ensure the repo remains warning/error/message-free from code analysis.
Also just fixing a couple extremely stupid bugs in SKUCrawler - I was using &= instead of |= to add symbol sources supported, so it often had SymbolSourcesSupported of None which is wrong. Also making sure to properly DisposeAsync each Session to be more deterministic about perf.
And then the biggest change by far which is that we now make an intentional tradeoff to spend more time in session "open" to do a "pre-process" of symbol information and collect up important things that we'll want many, many times later - in particular, for each RVA we record each SymIndexID that is present (there can be multiple when folding happens). This means we do one giant walk on startup instead of many tiny walks for each RVA range as we do various operations. This causes opening to take longer and take more memory upfront, but makes so many subsequent operations faster that this is a net win for almost all use cases.
How was the change tested?
All existing tests pass (and they caught many almost-regressions as this change was under development, hooray!).
Added some new tests for a couple things like TryFindSymIndicesByRVARange that are now very important.
Manually tested several large binaries inside Microsoft.
For one large binary inside Microsoft, BinaryBytes went from taking 24.5 minutes to 4.75 minutes.
For another large binary, SizeBench GUI here's some timings on my machine:
Before
After
Open session
16s
48s
Find all .text symbols
37.4s
1.6s
So it does take significantly longer to open the session for giant binaries that do heavy inlining and COMDAT folding and such, but this is "paid back" to the user the first time they look at something interesting. Then once you do the second thing you're really winning on total time.
In theory this could be worse if the user opens the GUI (now takes longer) and then doesn't look at any symbols and only goes to, say, annotations or type layouts...but that just doesn't seem like a super common use case and the only downside is that this is a little slower, it's still fully functional.
PR Checklist
[x] Contributor License Agreement (CLA) has been signed. If not, go here and sign the CLA
[x] Changes have been validated
[x] Documentation updated. Please add or update any docs in the repo as necessary.
Why is this change being made?
SizeBench has had perennial problems with performance for various customers, especially those with large binaries. Recently a binary was found with a pathologically bad case that took 17 hours to process which is obviously absurd - so I'm taking the time to invest in some significant core performance-focused changes to improve performance for most use cases.
Briefly summarize what changed
There's a variety of changes in this PR, found as I kept profiling and trying things until eventually performance trade-offs seemed right:
[ThreadStatic]
on someStringBuider
instances so we can re-use their internalchar[]
buffers over and over as we continually conjure up names for symbols and things like that.IEnumVARIANT
built-in marshaler but this is very chatty between managed and native. DIA's enumerators expose "batch" methods that need a little custom coding to take advantage of - if we pin an array ofIntPtr
we can pass that down to have it collect an entire "chunk" of things at once, then much more quickly cast these to COM interfaces up in the managed layer. This significantly reduces overhead of large enumerations.SessionDataCache
changed many sorted collections to unsorted so we don't pay for the insertion cost of inserting into sorted collections too much. The value of them being sorted wasn't worth it.RVARangeSet
can be made faster during coalescing (which happens a lot in large binaries) because we can pre-size our allocations for example.&=
instead of|=
to add symbol sources supported, so it often had SymbolSourcesSupported ofNone
which is wrong. Also making sure to properlyDisposeAsync
eachSession
to be more deterministic about perf.And then the biggest change by far which is that we now make an intentional tradeoff to spend more time in session "open" to do a "pre-process" of symbol information and collect up important things that we'll want many, many times later - in particular, for each RVA we record each SymIndexID that is present (there can be multiple when folding happens). This means we do one giant walk on startup instead of many tiny walks for each RVA range as we do various operations. This causes opening to take longer and take more memory upfront, but makes so many subsequent operations faster that this is a net win for almost all use cases.
How was the change tested?
TryFindSymIndicesByRVARange
that are now very important.For one large binary inside Microsoft, BinaryBytes went from taking 24.5 minutes to 4.75 minutes. For another large binary, SizeBench GUI here's some timings on my machine:
.text
symbolsSo it does take significantly longer to open the session for giant binaries that do heavy inlining and COMDAT folding and such, but this is "paid back" to the user the first time they look at something interesting. Then once you do the second thing you're really winning on total time.
In theory this could be worse if the user opens the GUI (now takes longer) and then doesn't look at any symbols and only goes to, say, annotations or type layouts...but that just doesn't seem like a super common use case and the only downside is that this is a little slower, it's still fully functional.
PR Checklist