Closed smhc closed 3 years ago
Thanks @smhc . This is a very good report. I will look into it this week.
I'm reasonably sure this is recently introduced - i.e didn't occur on 3.0.1 or with the commit to fix random IDs instead of PIDs.
That would be interesting side effect. (The citnames
process does not see the random ids.)
Was looking into the citnames
and code, and it full of linear list iteration multiple times. O(n^2) at least. I will try to come up with something more reasonable. :)
To be clear the bug didn't occur with the random ID change. i.e it was seemingly introduced in a later commit. But it's also possible I just didn't notice it before as it may have resolved itself when left unattended.
If it can't be sped up, would it be possible to have an '--append' mode that didn't attempt to dedupe or whatever processing it is doing? For scenarios where you know the additional compilation is not recompiling anything already processed.
Could reproduce this issue on a generated compilation database.
It was blowing up exponentially. (Double input size cause 4-5 times longer processing times.) It was an issue in the compilation database entries' merge function. I replaced the linear searches with a hash lookup (the same logic as it was in the 2.4 versions).
The fix is on the devel
branch. Will be released as 3.0.3
.
Happy to re-open this ticket if it still persists for you @smhc
Only just got around to re-testing this. I am still seeing issues with excessive CPU/disk during generation of compile commands. There is a very big 'compile_commands.commands.json' file being generated. This is over 1.5Gb in size, despite that my final compile_commands.json should only be around 15Mb.
The file contents consists of every single command run and the entire captured environment variables, e.g
{
"command": {
"arguments": [
"grep",
"^50058$"
],
"environment": {
"ANT_HOME": "/sxxx/apache-ant/rel",
"APR_HOME": "/sxxx/product/apr-1.6.5",
"APR_UTIL_HOME": "/sxxx/apr-util-1.6.1",
etc etc etc
This results in a very large file for my build. It also results in the machine crawling to a halt and I have to kill it so cannot capture stack traces. However I assume it as a result of trying to process or generate this very large file.
Is it necessary to capture the full environment per command? Is it necessary to capture every command, even those not matching a compiler name?
Hey @smhc , interesting area that you are testing. :)
Bear 3.0 was split the command capture and compilation database creation into two processes. This might be not efficient as it was when all these has happened in a single process. But it allows to extend the functionality and cover more use cases. Single purpose tools are better IMO than a swiss army knife. :)
I will try to optimize the processes to work less resource intensive and deal with big projects more smooth.
The codebase I'm working on isn't particularly large, compared to something like llvm or the linux kernel. But I suppose my build system does a lot more than call gcc from Makefiles. However I'm confident compiling something like llvm would trigger the problem. While a re-write may make the code more maintainable, the end user performance can't degrade too much otherwise there's no incentive to upgrade :)
Pre-filtering the list to compiler names and removing environment vars would cut it down considerably. Is there a reason the env vars need to be captured? These aren't output to compile_commands.json - but maybe the values need to be stored to denormalise the args of child processes?
Some parts of the environments are contributing to the final compilation database. (GCC is using these environment variables: CPATH
, C_INCLUDE_PATH
, CPLUS_INCLUDE_PATH
). And I wanted to make the intercept
command to be useful for other usages than generating the compilation database. Therefore I can't predict which environment variables needs to be forwarded.
But you are right, that can be one place where can trim the data size. (To filter unknown environment variables and record only those which are relevant for the compilation database creation.)
GCC is using these environment variables: CPATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH
Doesn't clangd query gcc to obtain this information without needing the compilation database? It's possible it could change per invocation or a tool other than clangd could be used - but for those using clangd and not modifying those env vars it could save considerable processing and file size by excluding it. Would it be practical to have an option to suppress env var capturing/processing?
Yes, as I said earlier it could be an option to trim the captured information. (Although I would do it later when other options failed to improve the performance.)
I don't fully get what's the story about Clangd. To my knowledge (which is not fresh) there are so many usage of a compilation database, and Clangd is just one of them... The compilation database shall contains all information in order to re-run the compilation on a given module. (To be more precise, we want an ordinary tool to build the same AST as the compiler was when it built the object file.) Therefore Bear filters out irrelevant flags, and add some implicit flags from environment.
And I understand the frustration that a slow tool can cause. I ask your patience to fix these glitches.
I believe clangd queries the compiler to derive compilation flags that are implicit, e.g.
g++ -Wp,-v -E -xc -dD -x c++ -dM /dev/null
These aren't captured by bear and must be derived from the environment by the tools. I'm not sure if it also looks at environment variables or whether the command above is impacted by them.
IMHO the final compile_commands.json should be similar to what is output by cmake - I'm not sure if that includes env variable impacts or not. If cmake doesn't - it would be good for bear to have a switch to exclude them also for both performance and compatibility.
And I understand the frustration that a slow tool can cause. I ask your patience to fix these glitches.
No rush, I appreciate the hard work. With performance and stability the main motivating factor to upgrade I think everyone is happy to wait to get the kinks worked out.
Let me illustrate what Bear does with an example. Let's say the project build is done with this shell script:
#!/usr/bin/env sh
C_INCLUDE_PATH=/usr/include/dependency:.
cd /home/user/sources
$CC -c -D_PROJECT=secret -Wall source.c -o source.o
$CC -o a.out source.o
The JSON compilation database for this project shall be
[
{
"file": "/home/user/sources/source.c",
"directory": "/home/user/source",
"command": "/usr/bin/cc -c -D_PROJECT=secret -Wall -I /usr/include/dependency -I . source.c -o source.o"
}
]
Let's say the user want to run IWYU (or any other Clang libtool based program), then she/he needs to call this command: iwyu_tool.py -p .
(No Clangd involved.) The environment where the command is executed is different to where the original compilation has happened.
I think the output of Bear is very similar to what CMake would create (if the project build was written in CMake and not a shell script).
As far as I understood you have problem with the size of the intermediate compile_commands.commands.json
file, which is produced by the intercept
command. I agree, the JSON is too verbose for it. But instead of throwing away (or filter) the environment capture, I would try to have a more compact format. (And if that's not helping, then go for filtering out environment variables.)
Can we re-open this? Or will it be covered by https://github.com/rizsotto/Bear/issues/343 ?
I'm not sure using SQLite will fix the problem of capturing every single command unnecessarily. I believe it will still use excessive disk space if it does.
Hey @smhc , this ticket was trying to address the "time spent in citnames". Is that still an issue? In case of yes, let's re-open it.
The #343 I'm working on, will solve the big memory consumption issue. Will it consume too much disk space or not, will see when it's there. (But it seems memory consumption is a more urgent issue for users than disk usage.)
Is the memory usage high due to processing this large intermediate file?
Would both the file size and memory usage be reduced if this intermediate file contained information filtered to only what is required? i.e fixing this may negate the need to use sqllite in the first place.
Certainly the size of this file is excessive and the processing time makes it unusable - whether that is solved by the use of sqllite I am not sure. But if all the same information is captured and stored, albeit in a different format, I suspect it will still be problematic.
I think it's worth re-opening the issue and I can re-test after the sqllite changes are available. However I would imagine anyone building a reasonable sized project would be impacted in the same way.
Perhaps there needs to be a generic "Regression processing medium to large sized project" issue that is left open until it's possible to build wine/llvm/kernel/whatever in a similar time/complexity as bear 2.x.
If you have ideas how to automate such regression test that would be helpful. (Or better if you can send a PR to set up a new GitHub action to verify the improvements with a PR, that would be a big help to me.)
Describe the bug When using --append on a relatively large compile_commands.json (about 15mb), citnames uses excessive CPU and spins for a long time (10+minutes before I killed it). It also creates a large, several GB, 'compile_commands.commands.json' file that contains nothing but '[]' during this process.
A few stack traces during this process :
To Reproduce Create a large-ish compile_commands.json, then attempt to append to this file with some more commands using the '--append' switch.
Expected behavior Similar performance as bear 2.4, i.e virtually instant.
Environment:
Additional context
Before you send...
README.md
file orman bear
? Y