rizsotto / Bear

Bear is a tool that generates a compilation database for clang tooling.
GNU General Public License v3.0
4.8k stars 313 forks source link

Excessive time spent in citnames (cs::semantic::Tools::transform) with --append #325

Closed smhc closed 3 years ago

smhc commented 3 years ago

Describe the bug When using --append on a relatively large compile_commands.json (about 15mb), citnames uses excessive CPU and spins for a long time (10+minutes before I killed it). It also creates a large, several GB, 'compile_commands.commands.json' file that contains nothing but '[]' during this process.

A few stack traces during this process :

#0  0x000000000043d7f7 in cs::semantic::ToolGcc::recognize(std::filesystem::path const&) const ()
#1  0x00000000004263d5 in cs::semantic::Tools::select(report::Command const&) const ()
#2  0x00000000004269e4 in cs::semantic::Tools::recognize(report::Execution const&) const ()
#3  0x0000000000426e1d in _ZNSt17_Function_handlerIFN4rust6ResultISt4listISt10shared_ptrIN2cs8semantic8SemanticEESaIS7_EESt13runtime_errorEERKN6report9ExecutionEEZNKS5_5Tools9transformERKNSC_6ReportEEUlRKT_E1_E9_M_invokeERKSt9_Any_dataSF_ ()
#4  0x000000000042761d in cs::semantic::Tools::transform(report::Report const&) const ()
#5  0x000000000041ecfd in cs::Application::operator()() const ()
#6  0x0000000000411be4 in main ()
#0  0x00007f62b1ccf080 in _int_malloc () from /lib64/libc.so.6
#1  0x00007f62b1cd27dc in malloc () from /lib64/libc.so.6
#2  0x00007f62b2590ecd in operator new(unsigned long) () from /lib64/libstdc++.so.6
#3  0x000000000042be39 in std::vector<std::sub_match<char const*>, std::allocator<std::sub_match<char const*> > >::_M_default_append(unsigned long) ()
#4  0x000000000042ef43 in bool std::__detail::__regex_algo_impl<char const*, std::allocator<std::sub_match<char const*> >, char, std::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, true>(char const*, char const*, std::match_results<char const*, std::allocator<std::sub_match<char const*> > >&, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#5  0x000000000043d877 in cs::semantic::ToolGcc::recognize(std::filesystem::path const&) const ()
#6  0x00000000004263d5 in cs::semantic::Tools::select(report::Command const&) const ()
#7  0x00000000004269e4 in cs::semantic::Tools::recognize(report::Execution const&) const ()
#8  0x0000000000426e1d in _ZNSt17_Function_handlerIFN4rust6ResultISt4listISt10shared_ptrIN2cs8semantic8SemanticEESaIS7_EESt13runtime_errorEERKN6report9ExecutionEEZNKS5_5Tools9transformERKNSC_6ReportEEUlRKT_E1_E9_M_invokeERKSt9_Any_dataSF_ ()
#9  0x000000000042761d in cs::semantic::Tools::transform(report::Report const&) const ()
#10 0x000000000041ecfd in cs::Application::operator()() const ()
#11 0x0000000000411be4 in main ()
#0  0x00007f62b25f0d40 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6
#1  0x0000000000426375 in cs::semantic::Tools::select(report::Command const&) const ()
#2  0x00000000004269e4 in cs::semantic::Tools::recognize(report::Execution const&) const ()
#3  0x0000000000426e1d in _ZNSt17_Function_handlerIFN4rust6ResultISt4listISt10shared_ptrIN2cs8semantic8SemanticEESaIS7_EESt13runtime_errorEERKN6report9ExecutionEEZNKS5_5Tools9transformERKNSC_6ReportEEUlRKT_E1_E9_M_invokeERKSt9_Any_dataSF_ ()
#4  0x000000000042761d in cs::semantic::Tools::transform(report::Report const&) const ()
#5  0x000000000041ecfd in cs::Application::operator()() const ()
#6  0x0000000000411be4 in main ()
#0  0x00000000004a7c31 in std::filesystem::path::_M_split_cmpts() ()
#1  0x000000000043ba2e in (anonymous namespace)::source_file(cs::semantic::CompilerFlag const&) ()
#2  0x000000000043c8a4 in _ZZNK2cs8semantic7ToolGcc12compilationsERKN6report7CommandEENKUlT_E_clISt4listINS0_12CompilerFlagESaISA_EEEEN4rust6ResultIS9_ISt10shared_ptrINS0_8SemanticEESaISH_EESt13runtime_errorEES6_.isra.0 ()
#3  0x000000000043d6ef in cs::semantic::ToolGcc::compilations(report::Command const&) const ()
#4  0x0000000000426c1e in cs::semantic::Tools::recognize(report::Execution const&) const ()
#5  0x0000000000426e1d in _ZNSt17_Function_handlerIFN4rust6ResultISt4listISt10shared_ptrIN2cs8semantic8SemanticEESaIS7_EESt13runtime_errorEERKN6report9ExecutionEEZNKS5_5Tools9transformERKNSC_6ReportEEUlRKT_E1_E9_M_invokeERKSt9_Any_dataSF_ ()
#6  0x000000000042761d in cs::semantic::Tools::transform(report::Report const&) const ()
#7  0x000000000041ecfd in cs::Application::operator()() const ()
#8  0x0000000000411be4 in main ()
#0  0x000000000042df91 in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#1  0x000000000042e02f in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#2  0x000000000042e02f in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#3  0x000000000042e013 in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#4  0x000000000042e1d4 in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#5  0x000000000042e1d4 in std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<char const*, std::allocator<std::sub_match<char const*> >, std::regex_traits<char>, true>::_Match_mode, long) ()
#6  0x000000000042f2f1 in bool std::__detail::__regex_algo_impl<char const*, std::allocator<std::sub_match<char const*> >, char, std::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, true>(char const*, char const*, std::match_results<char const*, std::allocator<std::sub_match<char const*> > >&, std::basic_regex<char, std::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#7  0x000000000043d877 in cs::semantic::ToolGcc::recognize(std::filesystem::path const&) const ()
#8  0x00000000004263d5 in cs::semantic::Tools::select(report::Command const&) const ()
#9  0x00000000004269e4 in cs::semantic::Tools::recognize(report::Execution const&) const ()
#10 0x0000000000426e1d in _ZNSt17_Function_handlerIFN4rust6ResultISt4listISt10shared_ptrIN2cs8semantic8SemanticEESaIS7_EESt13runtime_errorEERKN6report9ExecutionEEZNKS5_5Tools9transformERKNSC_6ReportEEUlRKT_E1_E9_M_invokeERKSt9_Any_dataSF_ ()
#11 0x000000000042761d in cs::semantic::Tools::transform(report::Report const&) const ()
#12 0x000000000041ecfd in cs::Application::operator()() const ()
#13 0x0000000000411be4 in main ()

To Reproduce Create a large-ish compile_commands.json, then attempt to append to this file with some more commands using the '--append' switch.

Expected behavior Similar performance as bear 2.4, i.e virtually instant.

Environment:

Additional context

Before you send...

rizsotto commented 3 years ago

Thanks @smhc . This is a very good report. I will look into it this week.

smhc commented 3 years ago

I'm reasonably sure this is recently introduced - i.e didn't occur on 3.0.1 or with the commit to fix random IDs instead of PIDs.

rizsotto commented 3 years ago

That would be interesting side effect. (The citnames process does not see the random ids.)

Was looking into the citnames and code, and it full of linear list iteration multiple times. O(n^2) at least. I will try to come up with something more reasonable. :)

smhc commented 3 years ago

To be clear the bug didn't occur with the random ID change. i.e it was seemingly introduced in a later commit. But it's also possible I just didn't notice it before as it may have resolved itself when left unattended.

If it can't be sped up, would it be possible to have an '--append' mode that didn't attempt to dedupe or whatever processing it is doing? For scenarios where you know the additional compilation is not recompiling anything already processed.

rizsotto commented 3 years ago

Could reproduce this issue on a generated compilation database.

It was blowing up exponentially. (Double input size cause 4-5 times longer processing times.) It was an issue in the compilation database entries' merge function. I replaced the linear searches with a hash lookup (the same logic as it was in the 2.4 versions).

The fix is on the devel branch. Will be released as 3.0.3.

Happy to re-open this ticket if it still persists for you @smhc

smhc commented 3 years ago

Only just got around to re-testing this. I am still seeing issues with excessive CPU/disk during generation of compile commands. There is a very big 'compile_commands.commands.json' file being generated. This is over 1.5Gb in size, despite that my final compile_commands.json should only be around 15Mb.

The file contents consists of every single command run and the entire captured environment variables, e.g

        {
            "command": {
                "arguments": [
                    "grep",
                    "^50058$"
                ],
                "environment": {
                    "ANT_HOME": "/sxxx/apache-ant/rel",
                    "APR_HOME": "/sxxx/product/apr-1.6.5",
                    "APR_UTIL_HOME": "/sxxx/apr-util-1.6.1",
                     etc etc etc

This results in a very large file for my build. It also results in the machine crawling to a halt and I have to kill it so cannot capture stack traces. However I assume it as a result of trying to process or generate this very large file.

Is it necessary to capture the full environment per command? Is it necessary to capture every command, even those not matching a compiler name?

rizsotto commented 3 years ago

Hey @smhc , interesting area that you are testing. :)

Bear 3.0 was split the command capture and compilation database creation into two processes. This might be not efficient as it was when all these has happened in a single process. But it allows to extend the functionality and cover more use cases. Single purpose tools are better IMO than a swiss army knife. :)

I will try to optimize the processes to work less resource intensive and deal with big projects more smooth.

smhc commented 3 years ago

The codebase I'm working on isn't particularly large, compared to something like llvm or the linux kernel. But I suppose my build system does a lot more than call gcc from Makefiles. However I'm confident compiling something like llvm would trigger the problem. While a re-write may make the code more maintainable, the end user performance can't degrade too much otherwise there's no incentive to upgrade :)

Pre-filtering the list to compiler names and removing environment vars would cut it down considerably. Is there a reason the env vars need to be captured? These aren't output to compile_commands.json - but maybe the values need to be stored to denormalise the args of child processes?

rizsotto commented 3 years ago

Some parts of the environments are contributing to the final compilation database. (GCC is using these environment variables: CPATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH). And I wanted to make the intercept command to be useful for other usages than generating the compilation database. Therefore I can't predict which environment variables needs to be forwarded.

But you are right, that can be one place where can trim the data size. (To filter unknown environment variables and record only those which are relevant for the compilation database creation.)

smhc commented 3 years ago

GCC is using these environment variables: CPATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH

Doesn't clangd query gcc to obtain this information without needing the compilation database? It's possible it could change per invocation or a tool other than clangd could be used - but for those using clangd and not modifying those env vars it could save considerable processing and file size by excluding it. Would it be practical to have an option to suppress env var capturing/processing?

rizsotto commented 3 years ago

Yes, as I said earlier it could be an option to trim the captured information. (Although I would do it later when other options failed to improve the performance.)

I don't fully get what's the story about Clangd. To my knowledge (which is not fresh) there are so many usage of a compilation database, and Clangd is just one of them... The compilation database shall contains all information in order to re-run the compilation on a given module. (To be more precise, we want an ordinary tool to build the same AST as the compiler was when it built the object file.) Therefore Bear filters out irrelevant flags, and add some implicit flags from environment.

And I understand the frustration that a slow tool can cause. I ask your patience to fix these glitches.

smhc commented 3 years ago

I believe clangd queries the compiler to derive compilation flags that are implicit, e.g.

g++ -Wp,-v -E -xc -dD -x c++ -dM /dev/null

These aren't captured by bear and must be derived from the environment by the tools. I'm not sure if it also looks at environment variables or whether the command above is impacted by them.

IMHO the final compile_commands.json should be similar to what is output by cmake - I'm not sure if that includes env variable impacts or not. If cmake doesn't - it would be good for bear to have a switch to exclude them also for both performance and compatibility.

And I understand the frustration that a slow tool can cause. I ask your patience to fix these glitches.

No rush, I appreciate the hard work. With performance and stability the main motivating factor to upgrade I think everyone is happy to wait to get the kinks worked out.

rizsotto commented 3 years ago

Let me illustrate what Bear does with an example. Let's say the project build is done with this shell script:

#!/usr/bin/env sh

C_INCLUDE_PATH=/usr/include/dependency:.
cd /home/user/sources
$CC -c -D_PROJECT=secret -Wall source.c -o source.o
$CC -o a.out source.o

The JSON compilation database for this project shall be

[
  {
    "file": "/home/user/sources/source.c",
    "directory": "/home/user/source",
    "command": "/usr/bin/cc -c -D_PROJECT=secret -Wall -I /usr/include/dependency -I . source.c -o source.o"
  }
]

Let's say the user want to run IWYU (or any other Clang libtool based program), then she/he needs to call this command: iwyu_tool.py -p . (No Clangd involved.) The environment where the command is executed is different to where the original compilation has happened.

I think the output of Bear is very similar to what CMake would create (if the project build was written in CMake and not a shell script).

As far as I understood you have problem with the size of the intermediate compile_commands.commands.json file, which is produced by the intercept command. I agree, the JSON is too verbose for it. But instead of throwing away (or filter) the environment capture, I would try to have a more compact format. (And if that's not helping, then go for filtering out environment variables.)

smhc commented 3 years ago

Can we re-open this? Or will it be covered by https://github.com/rizsotto/Bear/issues/343 ?

I'm not sure using SQLite will fix the problem of capturing every single command unnecessarily. I believe it will still use excessive disk space if it does.

rizsotto commented 3 years ago

Hey @smhc , this ticket was trying to address the "time spent in citnames". Is that still an issue? In case of yes, let's re-open it.

The #343 I'm working on, will solve the big memory consumption issue. Will it consume too much disk space or not, will see when it's there. (But it seems memory consumption is a more urgent issue for users than disk usage.)

smhc commented 3 years ago

Is the memory usage high due to processing this large intermediate file?

Would both the file size and memory usage be reduced if this intermediate file contained information filtered to only what is required? i.e fixing this may negate the need to use sqllite in the first place.

Certainly the size of this file is excessive and the processing time makes it unusable - whether that is solved by the use of sqllite I am not sure. But if all the same information is captured and stored, albeit in a different format, I suspect it will still be problematic.

I think it's worth re-opening the issue and I can re-test after the sqllite changes are available. However I would imagine anyone building a reasonable sized project would be impacted in the same way.

Perhaps there needs to be a generic "Regression processing medium to large sized project" issue that is left open until it's possible to build wine/llvm/kernel/whatever in a similar time/complexity as bear 2.x.

rizsotto commented 3 years ago

350 is the ticket to track the performance degradations. Please follow up on that ticket. (I would not re-open this one, since the time spent in citnames is no longer an issue.)

If you have ideas how to automate such regression test that would be helpful. (Or better if you can send a PR to set up a new GitHub action to verify the improvements with a PR, that would be a big help to me.)