llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.07k stars 11.59k forks source link

Memory Consumption Reduction for Large Array Initialization? #43744

Open efa63aab-f103-4eee-b246-ed4d9939edb3 opened 4 years ago

efa63aab-f103-4eee-b246-ed4d9939edb3 commented 4 years ago
Bugzilla Link 44399
Version trunk
OS Linux
CC @dwblaikie,@efriedma-quic,@zygoloid

Extended Description

I have a large number of files that I want to turn into data initializers for (static/global/const) arrays and then compile into my program. I generally rely on these arrays to be optimizer-transparent. Occasionally, they go up to 50~60 Megabytes.

The files are often dumped by xxd. A C++ header file representing 40 MB using XXD goes to about a 250 MB or so file.

When I want to do work (constexpr processing in some cases) on some of these arrays of (unsigned) char, I spend a lot of time just compiling and memory usage shoots up. When I spent a lot of time tracking down the parts of what I end up paying for, for just using a single entry out of an e.g. 20 MB array I would use ~2.1 GB of memory. For example, for a 20 MB file named "image.bin", I would use xxd -i image.bin image.bin.h" and then attempt to return a single item from that array through main() to just test the cost of parsing and holding onto that data in the compiler:

include"<image_bin.h"

int main () { return image_bin[2]; }

This program took ~2.1 GB of memory and around 39~40 seconds of compilation time, using 3 CPUs to nearly max. When I actually do real C++ with it, it gets much worse.

I am not interested in using the linker to create an object file and directly link it in, because that inhibits my ability to do constexpr/template metaprogramming and create more compact, interesting data structures from industry-standard file formats.

I would like to know if there are places in the Clang codebase I could contribute to, that would allow me to improve the throughput of such commonly deployed constructs. While I personally am the author of proposals heading to the C and C++ groups to make working with files less painful, maybe there is something I or others can do without new proposed functionality to help speed this kind of code up and reduce the memory footprint?

efa63aab-f103-4eee-b246-ed4d9939edb3 commented 4 years ago

(PPS: Nevermind, I'm just not very smart; I was reading the wrong outputs for GCC. With e.g. -O2 both LLVM and GCC are the same, and without they both keep all the data around!)

efa63aab-f103-4eee-b246-ed4d9939edb3 commented 4 years ago

(Small side note from additional checking: every output file actually leaves the entirety of the data in the final executable when I tested these programs, and only optimizes it out if I turn on additional optimization settings. This is the opposite of GCC which seems to constant fold out unused data at any optimization level.

The program used to test is similar to Comment 0, but it puts the data inside of the function locally to ensure that it is seen as "local" data and that it can and should be optimized out.)

efa63aab-f103-4eee-b246-ed4d9939edb3 commented 4 years ago

You could keep the source locations as a packed array, alongside the data; that's only 4 bytes per array element.

That's true. There could possibly be even better gains if it's kept as a starting point, and then each entry is an offset. But that's pretty complicated for a first go-around of an implementation.

It also means that we are forced to always pay the cost of tokenizing the source, which is also extremely unfortunate. The more I try to work on this and do things, the more I'm realizing that binary data does not gel well with C or C++ source code and that I should very seriously pursue other avenues of sticking the data for use.

Do these other compilers (you can say who they are, really, it's fine) have better (better enough to be usable?) performance than LLVM on the large initializer style code?

Here is a test with 4 bytes to 4 GB of data, increasing by x10 every time. Timings were done several times and averaged, even the ones that took minutes. At 400 MB and 1/4 GB, Swap was engaged and eventually Out of Memory was hit, so the numbers are not included here because there are no numbers (23.9 available GB of RAM for the compiler to consume in total with only a meager pittance for additional non-RAM SWAP space).

I report only Clang and GCC, because MSVC is god-awful and trying to extract timing numbers from cl.exe and link.exe is like pulling teeth (but MSVC is one of the compilers that has an internal max string length which makes putting large initializers in it impossible. A bug report was already filed there and it was already closed as "no thank you", IIRC. I'd have to go dig through things to find it again; it's been a few years).

Timings:

| Strategy | 4 bytes | 40 bytes | 400 bytes | 4 kilobytes | | xxd-generated GCC | 0.225 s | 0.215 s | 0.237 s | 0.247 s | | xxd-generated Clang | 0.272 s | 0.275 s | 0.272 s | 0.272 s |

| Strategy | 40 kilobytes | 400 kilobytes | 4 megabytes | 40 megabytes | | xxd-generated GCC | 0.406 s | 2.135 s | 23.567 s | 225.290 s | | xxd-generated Clang | 0.366 s | 1.063 s | 8.309 s | 83.250 s |

Memory:

Strategy 40 kilobytes 400 kilobytes 4 megabytes 40 megabytes
xxd-generated Clang ~34.60 MB ~97.98 MB ~680 MB ~6807 MB

There are not more memory timings because it's really hard to measure memory staring at htop and other tools very closely when it executes in under a second, so take these numbers with a grain of salt.

There are other compilers (static analyzers) that use either EDG or a custom front-end. These ones have different concerns, so they much more quickly run out of memory (and crawl to a halt) because some of them have guarantees such as "we will reconstruct your source exactly as you wrote it" and you can imagine those do not have a great time tracking all that information for a 60 MB initializer.

The good news is, Clang is at around 4x better than GCC, so at least I know I'm using the best of all the options!

I don't think I have the strength to refactor a codebase as large as Clang here to get potential throughout gains /maybe/. Apologies for taking your time. I'll try to figure something else out. Thank you for your suggestions and help!

efriedma-quic commented 4 years ago

I think something like the above also means you would need to eagerly error while creating a single fused AST node with a packed, "binary blob" representation does not lend itself to retaining line, column, etc. information on a per-element basis anymore.

You could keep the source locations as a packed array, alongside the data; that's only 4 bytes per array element.

... Yeah, this is an obscene amount of work I've made up on the fly, and I've only ever barely hacked Clang's parser before. Criminy.

Yes.

dwblaikie commented 4 years ago

Any other ideas? Just trying to think of ways to do this before I go writing up new fanciful ways of loading up binary data.

Other ideas in terms of how to modify your code, or the compiler? For your code - breaking it up into strings that meet the requirements of all your compilers may be useful/necessary. A bit tricky if it's a singular blob of (for example) zip compressed data, say - but if that's the case, there's not much you can do with it directly in constexpr context anyway, and using an external file (potentially embedded in the final binary (eg: incbin or similar), but not via any source level feature). If there's some structure to it, figuring out how to use that structure to group it into 4k hunks - might get adequate performance out of LLVM. (though for a 40MB file, that's still 10,000 chunks/variables, which might not be the most favorable thing ever).

Do these other compilers (you can say who they are, really, it's fine) have better (better enough to be usable?) performance than LLVM on the large initializer style code?

Side note: I'm not sure how any C++ compilation used 3 CPUs. The compiler is single threaded.

efa63aab-f103-4eee-b246-ed4d9939edb3 commented 4 years ago

I do use String Literals from time to time, but certain compilers (which will go unnamed right now) have extremely frustrating internal limits on the amount of data that can be put into a String Literal. The C++ standard specifies no limit on such data but the C one says that the "minimum supported value" is 4,095, which some compilers take extremely literally and some others take less literally, but still take it (around 65,355). This makes the code unportable in interesting ways as some have various different interpretations of that limitation: one makes it so that individuual string pieces in "foo" "bar" "baz" have a limit but the final joined string after parsing can be as long as possible, while others have hard limits even after "foo" "bar" "baz" joining.

(Everything beyond this point is idea spitballing, take absolutely none of it as serious suggestions.)

It looks like the cost of parsing the tokens is inescapable, so I will just have to accept that. I think that maybe in select special cases a knowledge-enhanced "initializer" parsing of C and C++ might help. That is, given a left hand side of the forms

type-specifier-seq ident[(optional integer constant)] = brace-init-list type-specifier-seq ident[(optional integer constant)] brace-init-list

One could use information from type-specifier-seq to make a very specific kind of initializer that more aggressively stores the data laid out in the brace-init-list, given it meets some criteria. For example, criteria for this special AST node would be if type-specifier-seq is a trivial type / "literal type", and if the right hand side's brace-init-list values are all constants that do not exceed 2 ^ (sizeof(type-specifier-seq) * CHAR_BIT) in value. This is something you would have to verify as you parse the tokens, using that special information from the left hand side.

If any of the brace-init-list elements are an expression that exceeds the size limit of the type or are a value that falls outside the potential binary representation of the type-specifier-seq, then you have to roll back all that work and just create a generic list.

I think something like the above also means you would need to eagerly error while creating a single fused AST node with a packed, "binary blob" representation does not lend itself to retaining line, column, etc. information on a per-element basis anymore.

I think as long as ExprConstant.cpp have ways of handling what would probably get stored as ConstantArray data, that part of the modification might not be so scary? But doesn't an APValue store what is essentially a vector of APValues to represent an Array? That's also wasteful too, there would need to some modification to that for ExprConstant processing to be more effective...

... Yeah, this is an obscene amount of work I've made up on the fly, and I've only ever barely hacked Clang's parser before. Criminy.

Any other ideas? Just trying to think of ways to do this before I go writing up new fanciful ways of loading up binary data.

efriedma-quic commented 4 years ago

The overhead here depends on how exactly you write the source code. clang keeps a complete AST for the entire source file in memory.

A string literal should have relatively low overhead; it's only one AST node. We make a few copies of the string data due to various API boundaries; not sure the exact number at the moment, but it should be something like 3x, not 100x like you're describing.

An initializer list is going to be orders of magnitude worse, though, in line with what you're reporting. If you write something like char arr[] = {1, 2, 3};, there are two AST expression nodes for each element of the array: one for the integer itself, and one to represent the implicit cast from int to char. And an expression node in clang is at least 24 bytes. (See include/clang/AST/Expr.h, include/clang/AST/Stmt.h.)

It might be possible to introduce some sort of optimized representation specifically for initializer lists. But it would be a big departure from existing AST handling. And it wouldn't really open up new use cases, given that string literal handling is already reasonably efficient.