Open llvmbot opened 8 years ago
@Jake
Here is is a couple of additional thoughts:
1) clang-format can work on a range of lines I wonder if the problem is that we are trying to work on the whole file in one go,
What if clang-format could be made to work iteratively over a file breaking very large files up into sections and performing the clang-format on each section
cf --offset=0 --length=1000 cf --offset=1000 --length=1000 cf --offset=2000 --length=2000 ...
This is the functional equivalent of breaking the file up at the beginfile/endfile comment, clang-formatting the pieces and sticking them back together.
A quick scan through the file could easily break the file into natural sections at convenient point like a blank line or comment.
2) I REALLY like your .clang-format-ignore idea, especially if it followed the .gitignore model, I've lots of auto generated code which I have to push through clang-format runs because at the end of my nightly build we 100% check every c++ file is clang-formatted, its just not that nice to write a script that does that a standard way of ignoring 10000 files but a local .clang-format-ignore file in the directory would be perfect. (oh! I'd love that!)
Yeah, ifdefs will increase the time linearly since it has to do multiple runs. But when I looked at where all the memory was going, the vast majority seemed to be in copying around unwrappedlines. Need to do more digging, but even if c-f was optimal, I doubt we could skip enough work to make it transparent with just 'clang-format off'. We still have to tokenize and write lines or else there'd be no output.
I don't think it's possible to solve this in the general case, but I think a mechanism for clang-format to skip an entire file is in scope. As it stands, I know lots of folks have wrapper scripts for clang format particularly to ignore given files like this. I personally would like to see .clang-format-ignore files canonicalized, particularly since my codebase has some huge autogenerated files with gobs of ifdefs. I also mentioned having a special directive people can put near the beginning of files, and we might be able to check for that before we split it into runs.
Here is what I see...
sqlite3.c is ~229,000 lines of code
If you take that file and perform clang format on the first 30,000 lines
head -30000 sqllite3.c | clang-format ? /dev/null
it will rise to 1.4GB in memory and takes 2:31 to format
Do all 30,000 lines files take this long?, answer no.. so what makes sqlite3.c so special.
That isn't 100% clear yet except what I noticed was the time increased significantly between 16250 lines and 16300 lines
$ time /usr/bin/head -16200 sqlite3.c | clang-format > /dev/null
real 0m3.090s
user 0m0.000s
sys 0m0.077s
$ time /usr/bin/head -16300 sqlite3.c | clang-format > /dev/null
real 0m11.057s
user 0m0.031s
sys 0m0.062s
However if I commented out the lines around there...
/*
** Figure out if we are dealing with Unix, Windows, or some other operating
** system.
**
** After the following block of preprocess macros, all of SQLITE_OS_UNIX,
** SQLITE_OS_WIN, and SQLITE_OS_OTHER will defined to either 1 or 0. One of
** the three will be 1. The other two will be 0.
#if defined(SQLITE_OS_OTHER)
# if SQLITE_OS_OTHER==1
# undef SQLITE_OS_UNIX
# define SQLITE_OS_UNIX 0
# undef SQLITE_OS_WIN
# define SQLITE_OS_WIN 0
# else
# undef SQLITE_OS_OTHER
# endif
#endif
#if !defined(SQLITE_OS_UNIX) && !defined(SQLITE_OS_OTHER)
# define SQLITE_OS_OTHER 0
# ifndef SQLITE_OS_WIN
# if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__) || \
defined(__MINGW32__) || defined(__BORLANDC__)
# define SQLITE_OS_WIN 1
# define SQLITE_OS_UNIX 0
# else
# define SQLITE_OS_WIN 0
# define SQLITE_OS_UNIX 1
# endif
# else
# define SQLITE_OS_UNIX 0
# endif
#else
# ifndef SQLITE_OS_WIN
# define SQLITE_OS_WIN 0
# endif
#endif
*/
The time significant dropped
$ time /usr/bin/head -16300 sqlite3.c | clang-format > /dev/null
real 0m5.688s
user 0m0.000s
sys 0m0.109s
This suggests that clang-format is impacted by preprocessor clauses, and notionally I believe that to be true in other work I've seen where it seems to evaluate multiple runs for each clause variation. Its quite possible something like this is impacting speed and memory. (but I have not evidence of this)
So with that commented out and looking at the 30,000 line example its using 1.4GB which suggests given thats 13% of the total size that the amount of memory needed would be ~7.6x1.4GB = 10.7GB (assuming its linear, which of course doubling memory allocation might mean its not)
So whilst it might be using a lot of memory... it would appear at a minimum the storage of the token structures might require as much as 10GB to format such a large file.
The question is
Is it reasonable for clang-format to use 1.4B for a 30,000 line file?
I guess the question comes down to the number of tokens, I've no idea how we can easily determine that, but I'm thinking the average number of tokens per line must be 4-7 allowing for blank lines and comments having only 1 or 2
We could likely expect 1,000,000+ tokens in the stream of tokens for a 229,000 line file (at least)
one look at FormatToken tells us it has a large number of variables just eye balling it.
some 19+ unsigned 12+ booleans 4 pointers 1 SmallVector 1 StringRef numerous enumerations
I really can't say if I think clang-format is using too much memory or if at this scale its just too much.
I think I'd be more concerned that the 229,000 file would take 7.6x2:31 = 11:50 to format
SQLLite3 is a unified source from the individual files, perhaps it should be formatted in its constituent parts before formatting.
/**** End of sqlite3rtree.h */ /**** Begin file sqlite3session.h */
I'm not 100% convinced this particular example is worth solving.
MyDeveloperDay
I'll take a look at it this weekend. My codebase has some >1MB headers which would benefit here too, though recent clang-format didn't crash with them. There is something similar in place for #if 0 (which may contain text the lexer doesn't like), but parsing just enough to look for a clang-format on comment will be a bit harder.
Does something similar need to be implemented for DisableFormat:true in style files?
I think native handling of clang-format-ignore files would be a good solution here too. A lot of codebases wrap clang-format to skip running clang-format on ignored files.
Trying to format SQLite ("Source Code" - "Amalgamation" from https://www.sqlite.org/download.html) still crashes with clang-format-9
$ wc sqlite-amalgamation-3320200/sqlite3.c
229616 1049648 8115947 sqlite-amalgamation-3320200/sqlite3.c
$ clang-format-9 < sqlite-amalgamation-3320200/sqlite3.c > /dev/null
LLVM ERROR: out of memory
Stack dump:
0. Program arguments: clang-format-9
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x1f)[0x7ff55208135f]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm3sys17RunSignalHandlersEv+0x50)[0x7ff55207f780]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(+0xa38761)[0x7ff552081761]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7ff55143c890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7ff54e2f3e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7ff54e2f5801]
/usr/lib/x86_64-linux-gnu/libLLVM-9.so.1(_ZN4llvm22report_bad_alloc_errorEPKcb+0x93)[0x7ff551fe60a3]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN4llvm23SmallVectorTemplateBaseIN5clang6format13UnwrappedLineELb0EE4growEm+0xbe)[0x7ff550e007be]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer20consumeUnwrappedLineERKNS0_13UnwrappedLineE+0x121)[0x7ff550dffb21]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format19UnwrappedLineParser5parseEv+0x119)[0x7ff550e10089]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13TokenAnalyzer7processEv+0xcf)[0x7ff550dfee2f]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format13guessLanguageEN4llvm9StringRefES2_+0x2d8)[0x7ff550de2728]
/usr/lib/x86_64-linux-gnu/libclang-cpp.so.9(_ZN5clang6format8getStyleEN4llvm9StringRefES2_S2_S2_PNS1_3vfs10FileSystemE+0x59)[0x7ff550de2859]
clang-format-9[0x4064ef]
clang-format-9[0x405688]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7ff54e2d6b97]
clang-format-9[0x40509a]
Aborted (core dumped)
I use LLVM-3.9.0-r264047-win32.exe.
Extended Description
When I use Windows 8 64 bit with 8G RAM, it fail immediately:
But when I use Windows 7 32 bit with 2G RAM, The system is very slow, and clang-format.exe use near 2G ram, then fail:
I add / clang-format off / as the first line, and NO / clang-format on /, and want to skip this error, but it dumps too. Please only copy data between / clang-format off / and / clang-format on / to output, and do NOT parse it, thanks.
We find many large file will dump. This is a public file: sqlite3.c, 3.85 MB (4,039,926 bytes): This file is an amalgamation of many separate C source files from SQLite version 3.6.23.