llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.69k stars 11.86k forks source link

clang-format.exe format big file out of 2G memory even when /* clang-format off */ is first line #27467

Open llvmbot opened 8 years ago

llvmbot commented 8 years ago
Bugzilla Link 27093
Version trunk
OS Windows XP
Reporter LLVM Bugzilla Contributor
CC @mydeveloperday,@nigeltao

Extended Description

When I use Windows 8 64 bit with 8G RAM, it fail immediately:

d:\>clang-format.exe -style=llvm -i sqlite3.c
0x00B03AF2 (0x0168EE97 0x02B1052E 0x00000000 0x0193C058)
0x00B0355D (0x839EB699 0x00000058 0x00000050 0x01930000)
0x77C7EE41 (0x00000030 0x00000002 0x01930000 0x00000010)
0x77C7E3A0 (0x01930000 0x00000002 0x01939740 0x839EB781)
0x77C674F9 (0xFFFFFFFE 0x00000050 0x00000004 0x01940588)
0x77C7EAB4 (0x00000000 0x01930260 0x00000004 0x0168E678)
0x77C7F9AE (0x0168E670 0x0193A264 0x0193C058 0x0168EE64)
0x00BDA951 (0x0168EEBC 0x0168F31C 0x0168E6F8 0x00000001)
0x00B02C69 (0x0168EEBC 0x0168F31C 0x02736020 0x003DA50E)
0x00AB247B (0x01939AB8 0x00000009 0x00BED821 0x00000000)
0x00AB1C3B (0x7F756000 0x0168FBA4 0x77C88F8B 0x7F756000)
0x777EA534 (0x7F756000 0x839EA965 0x00000000 0x00000000)
0x77C88F8B (0xFFFFFFFF 0x77C7DACF 0x00000000 0x00000000)
0x77C88F61 (0x00BDA92D 0x7F756000 0x00000000 0x00000000)

But when I use Windows 7 32 bit with 2G RAM, The system is very slow, and clang-format.exe use near 2G ram, then fail:

0x0103AEA0 (0x0A4CDA70 0x00000000 0x06CA3E3C 0x00100000) <unknown module>

I add / clang-format off / as the first line, and NO / clang-format on /, and want to skip this error, but it dumps too. Please only copy data between / clang-format off / and / clang-format on / to output, and do NOT parse it, thanks.

We find many large file will dump. This is a public file: sqlite3.c, 3.85 MB (4,039,926 bytes): This file is an amalgamation of many separate C source files from SQLite version 3.6.23.

mydeveloperday commented 4 years ago

@​Jake

Here is is a couple of additional thoughts:

1) clang-format can work on a range of lines I wonder if the problem is that we are trying to work on the whole file in one go,

What if clang-format could be made to work iteratively over a file breaking very large files up into sections and performing the clang-format on each section

cf --offset=0 --length=1000 cf --offset=1000 --length=1000 cf --offset=2000 --length=2000 ...

This is the functional equivalent of breaking the file up at the beginfile/endfile comment, clang-formatting the pieces and sticking them back together.

A quick scan through the file could easily break the file into natural sections at convenient point like a blank line or comment.

2) I REALLY like your .clang-format-ignore idea, especially if it followed the .gitignore model, I've lots of auto generated code which I have to push through clang-format runs because at the end of my nightly build we 100% check every c++ file is clang-formatted, its just not that nice to write a script that does that a standard way of ignoring 10000 files but a local .clang-format-ignore file in the directory would be perfect. (oh! I'd love that!)

llvmbot commented 4 years ago

Yeah, ifdefs will increase the time linearly since it has to do multiple runs. But when I looked at where all the memory was going, the vast majority seemed to be in copying around unwrappedlines. Need to do more digging, but even if c-f was optimal, I doubt we could skip enough work to make it transparent with just 'clang-format off'. We still have to tokenize and write lines or else there'd be no output.

I don't think it's possible to solve this in the general case, but I think a mechanism for clang-format to skip an entire file is in scope. As it stands, I know lots of folks have wrapper scripts for clang format particularly to ignore given files like this. I personally would like to see .clang-format-ignore files canonicalized, particularly since my codebase has some huge autogenerated files with gobs of ifdefs. I also mentioned having a special directive people can put near the beginning of files, and we might be able to check for that before we split it into runs.

mydeveloperday commented 4 years ago

Here is what I see...

sqlite3.c is ~229,000 lines of code

If you take that file and perform clang format on the first 30,000 lines

head -30000 sqllite3.c | clang-format ? /dev/null

it will rise to 1.4GB in memory and takes 2:31 to format

Do all 30,000 lines files take this long?, answer no.. so what makes sqlite3.c so special.

That isn't 100% clear yet except what I noticed was the time increased significantly between 16250 lines and 16300 lines

$ time /usr/bin/head -16200 sqlite3.c | clang-format > /dev/null

real    0m3.090s
user    0m0.000s
sys     0m0.077s

$ time /usr/bin/head -16300 sqlite3.c | clang-format > /dev/null

real    0m11.057s
user    0m0.031s
sys     0m0.062s

However if I commented out the lines around there...

/*
** Figure out if we are dealing with Unix, Windows, or some other operating
** system.
**
** After the following block of preprocess macros, all of SQLITE_OS_UNIX,
** SQLITE_OS_WIN, and SQLITE_OS_OTHER will defined to either 1 or 0.  One of
** the three will be 1.  The other two will be 0.
#if defined(SQLITE_OS_OTHER)
#  if SQLITE_OS_OTHER==1
#    undef SQLITE_OS_UNIX
#    define SQLITE_OS_UNIX 0
#    undef SQLITE_OS_WIN
#    define SQLITE_OS_WIN 0
#  else
#    undef SQLITE_OS_OTHER
#  endif
#endif
#if !defined(SQLITE_OS_UNIX) && !defined(SQLITE_OS_OTHER)
#  define SQLITE_OS_OTHER 0
#  ifndef SQLITE_OS_WIN
#    if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__) || \
        defined(__MINGW32__) || defined(__BORLANDC__)
#      define SQLITE_OS_WIN 1
#      define SQLITE_OS_UNIX 0
#    else
#      define SQLITE_OS_WIN 0
#      define SQLITE_OS_UNIX 1
#    endif
#  else
#    define SQLITE_OS_UNIX 0
#  endif
#else
#  ifndef SQLITE_OS_WIN
#    define SQLITE_OS_WIN 0
#  endif
#endif
*/

The time significant dropped

$ time /usr/bin/head -16300 sqlite3.c | clang-format > /dev/null

real    0m5.688s
user    0m0.000s
sys     0m0.109s

This suggests that clang-format is impacted by preprocessor clauses, and notionally I believe that to be true in other work I've seen where it seems to evaluate multiple runs for each clause variation. Its quite possible something like this is impacting speed and memory. (but I have not evidence of this)

So with that commented out and looking at the 30,000 line example its using 1.4GB which suggests given thats 13% of the total size that the amount of memory needed would be ~7.6x1.4GB = 10.7GB (assuming its linear, which of course doubling memory allocation might mean its not)

So whilst it might be using a lot of memory... it would appear at a minimum the storage of the token structures might require as much as 10GB to format such a large file.

The question is

Is it reasonable for clang-format to use 1.4B for a 30,000 line file?

I guess the question comes down to the number of tokens, I've no idea how we can easily determine that, but I'm thinking the average number of tokens per line must be 4-7 allowing for blank lines and comments having only 1 or 2

We could likely expect 1,000,000+ tokens in the stream of tokens for a 229,000 line file (at least)

one look at FormatToken tells us it has a large number of variables just eye balling it.

some 19+ unsigned 12+ booleans 4 pointers 1 SmallVector 1 StringRef numerous enumerations

I really can't say if I think clang-format is using too much memory or if at this scale its just too much.

I think I'd be more concerned that the 229,000 file would take 7.6x2:31 = 11:50 to format

SQLLite3 is a unified source from the individual files, perhaps it should be formatted in its constituent parts before formatting.

/**** End of sqlite3rtree.h */ /**** Begin file sqlite3session.h */

I'm not 100% convinced this particular example is worth solving.

MyDeveloperDay

llvmbot commented 4 years ago

I'll take a look at it this weekend. My codebase has some >1MB headers which would benefit here too, though recent clang-format didn't crash with them. There is something similar in place for #if 0 (which may contain text the lexer doesn't like), but parsing just enough to look for a clang-format on comment will be a bit harder.

Does something similar need to be implemented for DisableFormat:true in style files?

I think native handling of clang-format-ignore files would be a good solution here too. A lot of codebases wrap clang-format to skip running clang-format on ignored files.

cf10018d-1d2e-41f9-ab09-cc4591d6d858 commented 4 years ago

Trying to format SQLite ("Source Code" - "Amalgamation" from https://www.sqlite.org/download.html) still crashes with clang-format-9

$ wc sqlite-amalgamation-3320200/sqlite3.c
 229616 1049648 8115947 sqlite-amalgamation-3320200/sqlite3.c
$ clang-format-9 < sqlite-amalgamation-3320200/sqlite3.c > /dev/null
LLVM ERROR: out of memory
Stack dump:
0.      Program arguments: clang-format-9
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x1f)[0x7ff55208135f]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys17RunSignalHandlersEv+0x50)[0x7ff55207f780]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(+0xa38761)[0x7ff552081761]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7ff55143c890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7ff54e2f3e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7ff54e2f5801]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm22report_bad_alloc_errorEPKcb+0x93)[0x7ff551fe60a3]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN4llvm23SmallVectorTemplateBaseIN5clang6format13UnwrappedLineELb0EE4growEm+0xbe)[0x7ff550e007be]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer20consumeUnwrappedLineERKNS0_13UnwrappedLineE+0x121)[0x7ff550dffb21]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format19UnwrappedLineParser5parseEv+0x119)[0x7ff550e10089]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer7processEv+0xcf)[0x7ff550dfee2f]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13guessLanguageEN4llvm9StringRefES2_+0x2d8)[0x7ff550de2728]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format8getStyleEN4llvm9StringRefES2_S2_S2_PNS1_3vfs10FileSystemE+0x59)[0x7ff550de2859]
clang-format-9[0x4064ef]
clang-format-9[0x405688]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7ff54e2d6b97]
clang-format-9[0x40509a]
Aborted (core dumped)
llvmbot commented 8 years ago

I use LLVM-3.9.0-r264047-win32.exe.