Large binary sizes - Githubissues

matajoh commented 5 months ago

Trieste executables are currently quite large, for example, on Ubuntu in Debug:

~/Trieste/build/dist$ tree -sh parsers
[4.0K]  parsers
├── [ 13M]  json_fuzzer
├── [ 13M]  json_test
├── [ 22M]  yamlc
├── [ 21M]  yaml_fuzzer
└── [ 24M]  yaml_test

~/Trieste/build/dist$ tree -sh infix
[4.0K]  infix
├── [ 14M]  infix
└── [ 13M]  infix_trieste

Even in Release, they are larger than may be warranted (in particular, for infix):

~/Trieste/build/dist$ tree -sh parsers
[4.0K]  parsers
├── [1.3M]  json_fuzzer
├── [1.4M]  json_test
├── [3.6M]  yamlc
├── [3.6M]  yaml_fuzzer
└── [3.8M]  yaml_test

~/Trieste/build/dist$ tree -sh infix
[4.0K]  infix
├── [1.5M]  infix
└── [1.4M]  infix_trieste

fhackett commented 5 months ago

I have investigated this using a mix of disassembly and binary analysis. There are two short answers, one for debug and one for release builds. For debug, only 10% of the binary is actually code; all the rest seems to be various metadata and debug info. The code that is there seems fine, though clearly a lot can be pruned - RE2 seems benefit a lot from dead code elimination in release mode, for example. For release, most of the binary is code (70% roughly), and the largest functions, at 2-6% of total program symbol bytes, are Trieste rule definitions with lots of rules. I mean the functions which when called produce PassDef, not the associated lambdas that do actual work. These seem very big for what they are.

Methodology

Tools used:

Ghidra CodeBrowser
elf-size-analyze wrapper around binutils (lists symbols with binary size info)

I tested the rego-cpp implementation and my local fork of infix (mostly the same thing, but some code is added that may affect specific sizes of things). I looked at both release and debug builds of those two programs. I have attached data dumps I got from using elf-size-analyze: elf-size-analyze-reports.zip. You can reproduce these by installing elf-size-analyze and running elf-size-analyze --rom --no-color <the binary>, which should produce similar data. To see mangled symbols which might cross-reference better with Ghidra's output given the same file, add --no-demangle.

For how I used Ghidra, I imported each executable into CodeBrowser with default settings. I used it to explore either decompilations of the binary in some cases, or the sizes of different sections and their contents in other cases, by selecting areas of the executable and looking at the number of bytes reported. When cross-referencing functions from elf-size-analyze, I used Navigation > Go To ... and pasted mangled symbol names, waited for the tool to decompile that function, and looked for anything that stood out to me.

Rego analysis

Debug build

The debug build is about 10% code according to Ghidra, with most sections being a combination of typeinfo and debug metadata. Interestingly, the debug code footprint is about 2mb smaller than the release build's. This is likely due to some sub-optimal inlining choices in the release build.

Release build

The largest function in rego-cpp is the items() rule from Trieste's YAML parser, which it imports. (first line of definition: https://github.com/microsoft/Trieste/blob/main/parsers/yaml/reader.cc#L1922) Its code when fully optimized is 174,374kb, and I couldn't analyze it directly because it exceeded the capacity of Ghidra's decompiler under default settings. The next few largest symbols are also rules: symbols() and rulebody() from Rego-cpp.

One symbol that doesn't look like a rule is called _GLOBAL__sub_I_reader.cc, which appears to have more than one definition. Guessing from the contents of the first definition, it looks like a static initializer from a WF definition in reader.cc. Based on what I found when looking at Infix (whose 3rd largest function I could decompile as it was only 36kb), I think this is another case of a somewhat big function getting inlined many times as parts of a large, compound expression, thus inflating binary size. See below for why I think that.

Infix

The Infix debug build is much the same as Rego's, but on a smaller scale. I have nothing significant to add to that. The release build, however, is more manageable to decompile, and lets me continue my investigation regarding why rule functions are being compiled to so much machine code.

The first and second-largest functions don't decompile, but the 3rd largest one does. I succeeded in decompiling expressions(), defined in read.cc, Looking at the inferred C-like code, I notice that, aside from references to constants (the token IDs), almost everything is inlined. Specifically, I see a looping construct repeating over and over, multiple times per what looks like a rule.

By elimination, the only code that isn't in a lambda (lambdas are not transitively called here, just initialized) is *, T(...), In(...), <<, and >>. Looking at each definition, I notice that FastPattern::match_seq is the only thing with a loop in it. It's invoked on every *, so it makes sense to see that code inlined a few times per rule, because that is the most common operator.

I have a theory that preventing inlining of that one constructor routine would cut binary size by 1-2 mb, at least on this version of my local compiler etc etc.

Thoughts

I have learned two things from doing this:

Debug builds, while gigantic, are not representative of release builds in terms of code size. They are not "release but bigger"; the .text section actually shrinks for a debug build.
Inlining defaults seem to work poorly when compiling Trieste's idiomatic huge nested initialization-time expressions, at least when looking at some of these inlining decisions. If it is deemed worth it, tuning the inlining would probably shrink code by a lot; these custom operators get almost completely inlined right now, despite being sometimes non-trivial constructors. That policy probably works better for std::ranges code or something, where the inlined result is a tight for-loop that runs a lot, but this code isn't even hot: it's setup code, so reducing the inlining seems reasonable.

mjp41 commented 5 months ago

Is you add SNMALLOC_SLOW_PATH to the various operator overloadings inside rewrite.h it would be interesting to see what happens to the binary size. Most of that code is effectively dynamically compiled at startup. All the fast path generation and rewrite tables are built on constructing PassDef, and not statically.

fhackett commented 5 months ago

Minor update: I disproved my theory that it was that FastPattern construction in particular.

I added the SNMALLOC_SLOW_PATH, and it didn't inline it (I see the call in the disassembly), but it also didn't shrink the binary. I will revisit this when I have more time to figure out where exactly all the code is coming from.

mjp41 commented 5 months ago

I am wondering if it is some of my gratuitous use of templates. I did a bunch of stuff to make it go faster, before I build the FastPattern code that rapidly rejects potential pattern matches. It might be that making things like

T(Token... ts)

Be some careful trick to remove the template and use a dynamic size. You would want to force the inlining of the initial template bit, to turn it into a vector, but not the vector bit.

I can try to prototype what I mean, if that would be helpful.

fhackett commented 5 months ago

You're welcome to if you have time, but for me a big nagging question is "well what is that repeated inlined code doing then, if it's not that constructor?".

I'll set aside some time soon (give or take checking out some of the big conference thing for the next 3 days), because at this point I just want to know. Maybe it will be something surprising that affects decision making... or not at all, who knows.

fhackett commented 5 months ago

Some off-channel discussion and experiments later, I've isolated what I think are key factors making the binaries big.

It seems to be the implicitly-defined lifecycle functions (copy constructors, move constructors, destructors, etc) getting inlined. Making a WF, rule, or pass takes a lot of intermediate objects. Some of these objects contain non-trivial data structures like sets, maps, and vectors. Even if there's not much runtime performance effect due to it not being hot code, inlining the various builder operations (as the compiler seems to like doing) ends up consuming a lot of space.

The tricky part is that all these implicit functions are not even in the source usually, so I went through and explicitly defaulted then marked noinline some likely suspects (as well as some operators that just call into stdlib to change a vector or map). The result is a 15% code size decrease for Infix, to the tune of 170kb for an originally 1.1mb executable.

Past this point, further noinline directives seem to have little effect (the last one I tried only cut 2kb total). A lot of the remaining code size is libraries used like re2 and CLI11.

fhackett commented 5 months ago

I tried the patched Trieste on Rego-cpp as well, and it saves 28% binary size, or barely under 2mb.

Given a cursory look, the largest symbols in the shrunk binary are mostly lambdas and such with implementation code in them. The largest function is 72kb, compared to the previous largest which was 174kb.

fhackett commented 4 months ago

To update this more accurately, I think the lifecycle code (std::shared_ptr and such) is still a significant factor in code size.

Looking at the largest rule lambdas in a recent size report rego_release_nx_report.zip, I can still see a lot of repeating inlined code, possibly Node destructors. I can see other strange patterns sometimes, but the reference count-decrease code seems to dominate, at least when scrolling around the disassembly.

It's not clear exactly how to deal with this, but if this issue gets passed onto someone else at some point, hopefully this means a bit less of it is all in my head.

microsoft / Trieste

Large binary sizes #115

Methodology

Rego analysis

Debug build

Release build

Infix

Thoughts