rizsotto / Bear

Bear is a tool that generates a compilation database for clang tooling.
GNU General Public License v3.0
4.76k stars 312 forks source link

Bear is stuck #443

Open marxin opened 2 years ago

marxin commented 2 years ago

I can't build GCC with bear, where it's stack at the end of compilation:

bear -- make -j16
...
libtool: link: ( cd ".libs" && rm -f "libgfortran.la" && ln -s "../libgfortran.la" "libgfortran.la" )
make[3]: Leaving directory '/dev/shm/objdir/x86_64-pc-linux-gnu/libgfortran'
make[2]: Leaving directory '/dev/shm/objdir/x86_64-pc-linux-gnu/libgfortran'
make[1]: Leaving directory '/dev/shm/objdir'

There's process tree: Screenshot from 2022-01-31 12-40-28

Where I'm using the latest release:

$ bear --version
bear 3.0.18
rizsotto commented 2 years ago

Hey Martin, I need more context in order to help. Could you fill out the issue template? Also, run the command with verbose flag and attach the output? Thanks!

marxin commented 2 years ago

All right, so info from the issue template would be:

uname -a
Linux marxinbox.suse.cz 5.16.2-1-default #1 SMP PREEMPT Mon Jan 24 18:27:48 UTC 2022 (0d710a8) x86_64 x86_64 x86_64 GNU/Linux

Bear is from openSUSE TW distribution, normally installed package.

Using --verbose leads to a different error:

$ bear --verbose -- make
...
g++: fatal error: cannot execute 'cc1plus': execvp: No such file or directory
...

bear.log.txt

rizsotto commented 2 years ago

Interesting log!!!

I've seen the message g++: fatal error: cannot execute 'cc1plus': execvp: No such file or directory when the PATH environment is empty. And from the logs I see it is empty from the very first commands after Bear executes make.

Is that possible that the Makefile sets the environment empty?

Is this a GCC build? This was problematic on Kali linux too. GCC is executes cc1 or cc1plus, which is not in the PATH. (But it knows the location for it.) What I would expect that the GCC driver program (gcc, cc, g++, c++, etc.) executes the cc1plus with full path. But what it does is just call execvp only with the name of the program, which suppose to search in the PATH.

I don't really know how to fix empty PATH execution for GCC. I know that it works without Bear. (/usr/bin/env - /usr/bin/gcc -c /dev/null works just fine.)

marxin commented 2 years ago

So what's weird is that w/o the --verbose argument it works (until the end where it's stuck).

Plus I think GCC drive uses execve with a full path if I see correctly:

strace -f -s 512 g++ -fcf-protection -fno-PIE -c  -DIN_GCC_FRONTEND -DIN_GCC_FRONTEND -g       -DIN_GCC -fPIC    -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common  -DHAVE_CONFIG_H -I. -Ic-family -I/home/marxin/Programming/gcc/gcc -I/home/marxin/Programming/gcc/gcc/c-family -I/home/marxin/Programming/gcc/gcc/../include -I/home/marxin/Programming/gcc/gcc/../libcpp/include -I/home/marxin/Programming/gcc/gcc/../libcody  -I/home/marxin/Programming/gcc/gcc/../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libdecnumber/bid -I../libdecnumber -I/home/marxin/Programming/gcc/gcc/../libbacktrace   -o c-family/c-common.o -MT c-family/c-common.o -MMD -MP -MF c-family/.deps/c-common.TPo /home/marxin/Programming/gcc/gcc/c-family/c-common.cc 2>&1 | grep execv
execve("/usr/bin/g++", ["g++", "-fcf-protection", "-fno-PIE", "-c", "-DIN_GCC_FRONTEND", "-DIN_GCC_FRONTEND", "-g", "-DIN_GCC", "-fPIC", "-fno-exceptions", "-fno-rtti", "-fasynchronous-unwind-tables", "-W", "-Wall", "-Wno-narrowing", "-Wwrite-strings", "-Wcast-qual", "-Wmissing-format-attribute", "-Woverloaded-virtual", "-pedantic", "-Wno-long-long", "-Wno-variadic-macros", "-Wno-overlength-strings", "-fno-common", "-DHAVE_CONFIG_H", "-I.", "-Ic-family", "-I/home/marxin/Programming/gcc/gcc", "-I/home/marxin/Programming/gcc/gcc/c-family", "-I/home/marxin/Programming/gcc/gcc/../include", "-I/home/marxin/Programming/gcc/gcc/../libcpp/include", "-I/home/marxin/Programming/gcc/gcc/../libcody", "-I/home/marxin/Programming/gcc/gcc/../libdecnumber", "-I/home/marxin/Programming/gcc/gcc/../libdecnumber/bid", "-I../libdecnumber", "-I/home/marxin/Programming/gcc/gcc/../libbacktrace", "-o", "c-family/c-common.o", "-MT", "c-family/c-common.o", "-MMD", "-MP", "-MF", "c-family/.deps/c-common.TPo", "/home/marxin/Programming/gcc/gcc/c-family/c-common.cc"], 0x7fffffffdbf8 /* 81 vars */) = 0
[pid 23829] execve("/usr/lib64/gcc/x86_64-suse-linux/11/cc1plus", ["/usr/lib64/gcc/x86_64-suse-linux/11/cc1plus", "-quiet", "-I", ".", "-I", "c-family", "-I", "/home/marxin/Programming/gcc/gcc", "-I", "/home/marxin/Programming/gcc/gcc/c-family", "-I", "/home/marxin/Programming/gcc/gcc/../include", "-I", "/home/marxin/Programming/gcc/gcc/../libcpp/include", "-I", "/home/marxin/Programming/gcc/gcc/../libcody", "-I", "/home/marxin/Programming/gcc/gcc/../libdecnumber", "-I", "/home/marxin/Programming/gcc/gcc/../libdecnumber/bid", "-I", "../libdecnumber", "-I", "/home/marxin/Programming/gcc/gcc/../libbacktrace", "-MMD", "c-family/c-common.d", "-MF", "c-family/.deps/c-common.TPo", "-MP", "-MT", "c-family/c-common.o", "-D_GNU_SOURCE", "-D", "IN_GCC_FRONTEND", "-D", "IN_GCC_FRONTEND", "-D", "IN_GCC", "-D", "HAVE_CONFIG_H", "/home/marxin/Programming/gcc/gcc/c-family/c-common.cc", "-quiet", "-dumpdir", "c-family/", "-dumpbase", "c-common.cc", "-dumpbase-ext", ".cc", "-mtune=generic", "-march=x86-64", "-g", "-Wextra", "-Wall", "-Wno-narrowing", "-Wwrite-strings", "-Wcast-qual", "-Wsuggest-attribute=format", "-Woverloaded-virtual", "-Wpedantic", "-Wno-long-long", "-Wno-variadic-macros", "-Wno-overlength-strings", "-fcf-protection=full", "-fPIC", "-fno-exceptions", "-fno-rtti", "-fasynchronous-unwind-tables", "-fno-common", "-o", "/tmp/ccUlqdms.s"], 0x501ec0 /* 85 vars */ <unfinished ...>
[pid 23829] <... execve resumed>)       = 0
[pid 23829] write(3, "builtin_dgettext\"\n.LC1001:\n\t.string\t\"__builtin_dwarf_cfa\"\n.LC1002:\n\t.string\t\"__builtin_dwarf_sp_column\"\n.LC1003:\n\t.string\t\"__builtin_eh_return\"\n\t.align 8\n.LC1004:\n\t.string\t\"__builtin_eh_return_data_regno\"\n.LC1005:\n\t.string\t\"__builtin_execl\"\n.LC1006:\n\t.string\t\"__builtin_execlp\"\n.LC1007:\n\t.string\t\"__builtin_execle\"\n.LC1008:\n\t.string\t\"__builtin_execv\"\n.LC1009:\n\t.string\t\"__builtin_execvp\"\n.LC1010:\n\t.string\t\"__builtin_execve\"\n.LC1011:\n\t.string\t\"__builtin_exit\"\n.LC1012:\n\t.string\t\"__builtin_expect\"\n\t.align 8\n.LC10"..., 4096) = 4096
[pid 23830] execve("/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/as", ["/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/as", "-I", ".", "-I", "c-family", "-I", "/home/marxin/Programming/gcc/gcc", "-I", "/home/marxin/Programming/gcc/gcc/c-family", "-I", "/home/marxin/Programming/gcc/gcc/../include", "-I", "/home/marxin/Programming/gcc/gcc/../libcpp/include", "-I", "/home/marxin/Programming/gcc/gcc/../libcody", "-I", "/home/marxin/Programming/gcc/gcc/../libdecnumber", "-I", "/home/marxin/Programming/gcc/gcc/../libdecnumber/bid", "-I", "../libdecnumber", "-I", "/home/marxin/Programming/gcc/gcc/../libbacktrace", "--gdwarf-5", "--64", "-o", "c-family/c-common.o", "/tmp/ccUlqdms.s"], 0x501ec0 /* 85 vars */ <unfinished ...>
[pid 23830] <... execve resumed>)       = 0
marxin commented 2 years ago

And yes, cc1plus is really not on PATH:

$ which cc1plus
which: no cc1plus in (/home/marxin/bin/valgrind/bin:/home/marxin/.local/bin:/home/marxin/bin:/usr/local/bin:/usr/bin:/bin:/home/marxin/Programming/gcc-util/boilerplate:/home/marxin/Programming/gcc-util/dumps:/home/marxin/Programming/script-misc)
rizsotto commented 2 years ago

Nice catch, Bear reports execvp for that execution. Maybe that's going to be the problem. Will look at it on the weekend.

rizsotto commented 2 years ago

I'm trying to reproduce it with the latest master, but the bug does not show up... The verbose log shows that it's using execvp and the cc1 is with full path.

This is what I'm running on Fedora or Arch:

$ /usr/bin/env - ../Bear.install/bin/bear -- /usr/bin/env - /usr/bin/gcc -c /tmp/empty.c
$ cat compile_commands.json
[
  {
    "arguments": [
      "/usr/bin/gcc",
      "-c",
      "/tmp/empty.c"
    ],
    "directory": "/tmp",
    "file": "/tmp/empty.c"
  }
]
dkm commented 2 years ago

Not sure this is related, but I've never been able to do a parallel build of gcc with bear (same symptom as @marxin, build is stuck at the end). I usually do a sequential build when I'm not doing any dev. IIRC, @philberty had the same issue). Happy to help if needed.

rizsotto commented 2 years ago

@dkm there are two issues you mention:

rizsotto commented 2 years ago

I've seen intercept to be slow in the past, because it was not able to write the entries fast enough. (The root cause for it was that intercept was using SQLight to store events, but that was removed in recent versions.) If you running the compilation on a remote drive, or a drive which is slow, that can cause intercept to look like stuck on the job.

Things to try our:

theIDinside commented 2 years ago

I have a similar problem; my problem arises whenever I use more than 6 jobs for make.

hwti commented 1 year ago

I get the same issue at various gcc build steps when used with crosstool-ng, even with only one job. Maybe there is an issue with command redirection, since killing a cut - process terminates everything (obviously with a failure).

During the build, even the main intercept process uses quite a lot of CPU, it creates a giant >1GB file (logging all commands, including mv for example) which later gets reduced to 3MB.

Bear 2.4.4 works, and is a lot faster.

dkm commented 1 year ago

FWIW, I've tried again this morning using a freshly built bear: same behavior as described above (also tested with latest debian's package 3.0.20-1+b3). But I'm not sure it really picks everything from my local install as I can see :

└─ bear -- make -j8 all
   └─ intercept --library /usr/$LIB/bear/libexec.so --wrapper /usr/lib/x86_64-linux-gnu/bear/wrap
      ├─ intercept --library /usr/$LIB/bear/libexec.so --wrapper /usr/lib/x86_64-linux-gnu/bear/w
      ├─ intercept --library /usr/$LIB/bear/libexec.so --wrapper /usr/lib/x86_64-linux-gnu/bear/w

The bear command is the correct one. Removing the debian package seems to correct this, so maybe there's something to be fixed in the search routine?

The process never seems to finish, I don't see any CPU activity, still have plenty of free RAM... So not sure what's wrong.:

    1[||                                                                         1.3%] Tasks: 166, 1077 thr, 232 kthr; 1 running
    2[|                                                                          0.6%] Load average: 1.24 4.76 4.86 
    3[|                                                                          0.6%] Uptime: 17:45:36
    4[|||                                                                        2.6%]
    5[|                                                                          0.6%]
    6[||                                                                         1.3%]
    7[|||                                                                        1.9%]
    8[|||                                                                        2.5%]
    9[|||                                                                        1.9%]
   10[|||                                                                        1.9%]
   11[|                                                                          0.6%]
   12[|                                                                          0.6%]
  Mem[||||||||||||||||||||||||||||||||||||                                5.12G/47.0G]
  Swp[                                                                       0K/48.8G]

    PID△USER       PRI  NI  VIRT   RES   SHR S  CPU% MEM%   TIME+  Command                                                                                                   
3212785 dkm         20   0  6284  3944  3600 S   0.0  0.0  0:00.00 │  │  └─ bear -- make -j8 all                                                                             
3212786 dkm         20   0 3771M 43836 15588 S   0.0  0.1  1:11.61 │  │     └─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local/st
3212787 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.03 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212788 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212789 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:01.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212790 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212791 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212792 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212793 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212794 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
3212795 dkm         20   0 3771M 43836 15588 S   0.0  0.1  0:00.00 │  │        ├─ intercept --library /home/dkm/local/stow/bear/lib/bear/libexec.so --wrapper /home/dkm/local
....
snprajwal commented 1 year ago

+1, I'm facing the same issue with gccrs. If I use bear -- make -j8, the compilation succeeds, but the process never exits and like @dkm said, no CPU or RAM consumed either. It just sorta indefinitely hangs. I also noticed that the compile_commands.events.json was around 253M for me. Upon CTRL-C-ing the process, the compile_commands.json contained 14 entries.

rizsotto commented 1 year ago

Thanks guys for this report. Is there a way that you could identify the command that hung the build? And create a minimal example which could help me to reproduce this error? That would be a great help to fix this bug.

dkm commented 1 year ago

Sure, I'll see what I can find, thanks @rizsotto

dkm commented 1 year ago

Is there a way to see what is being executed? I have a stuck bear but don't really know how to dig it :)

snprajwal commented 1 year ago

Is there a way to see what is being executed? I have a stuck bear but don't really know how to dig it :)

Maybe try bear -vvvv -- make?

dkm commented 1 year ago

I've got a stuck bear, with the last lines on the term being:

[16:48:00.206738, cs, 2849623] [pid: 2849622] recognition failed: No tools recognize this execution.                                                                                                                
[16:48:00.207043, cs, 2849623] compilation entries created. [size: 0]                                                                                                                                               
[16:48:00.207048, cs, 2849623] compilation entries to output. [size: 0]                                                                                                                                             
[16:48:00.207176, cs, 2849623] compilation entries written. [size: 0]                                                                                                                                               
[16:48:00.207186, cs, 2849623] succeeded with: 0                                                                                                                                                                    
[16:48:00.208142, br, 2694891] Process wait request: done. [pid: 2849623]                                                                                                                                           
[16:48:00.208207, br, 2694891] Running citnames finished. [Exited with 0]                                                                                                                                           
[16:48:00.224768, br, 2694891] succeeded with: 0 

The log is rather big with all my env, so not very comfortable putting it here. I can give it to someone for debuging :)

The last command seems to be:


[16:48:00.206412, cs, 2849623] [pid: 2849622] execution: {"executable":"/bin/bash","arguments":["/bin/bash","-c","test -f config.h || make \"AR_FLAGS=rc\" \"CC_FOR_BUILD=gcc\" \"CFLAGS=-g -O2  -m32\" \"CXXFLAGS=-
g -O2 -D_GNU_SOURCE  -m32\" \"CFLAGS_FOR_BUILD=-g -O2\" \"CFLAGS_FOR_TARGET=-g -O2\" \"INSTALL=/usr/bin/install -c\" \"INSTALL_DATA=/usr/bin/install -c -m 644\" \"INSTALL_PROGRAM=/usr/bin/install -c\" \"INSTALL_S
CRIPT=/usr/bin/install -c\" \"JC1FLAGS=\" \"LDFLAGS=-m32\" \"LIBCFLAGS=-g -O2  -m32\" \"LIBCFLAGS_FOR_TARGET=-g -O2\" \"MAKE=make\" \"MAKEINFO=makeinfo --split-size=5000000   \" \"PICFLAG=\"
...
rizsotto commented 1 year ago

Thanks @dkm for this update.

Will test if that command alone can cause the build stuck.

But what I find strange is the output you've pasted here reports the whole build process was finished. The intercept and citnames processes are finished. And even bear finished, which was calling these processes. (The succeeded with... lines are literally the last lines in the main function of these tools.)

dkm commented 1 year ago

Oh, maybe some thread was started with incorrect parameter and is preventing the process to finish until it is joined?

mmalcomson commented 1 month ago

Hello, I've been having the same problem and am pretty sure I know what's happening on my machine. Whether it's the same thing as what's happening on others' machine I'm not sure.

For me the root cause of the problem appears to be fixincl starting a "server shell" which it later sends commands. This "server shell" is supposed to be sent a SIGKILL on fixincl exit, but since the PID of the child that fixincl spawns is a wrapper process when run under bear the signal does not get to the underlying bash process and this "shell server" simply stays around.

I have a reproducer for the creation of a process that doesn't get killed because of the bear wrapper -- but not fot the hang. In gcc the hang appears to be because the file descriptor for make jobserver information is open in the fixincl shell server. This means that the adrift shell server blocks make from finishing by making the make blocking read on it's jobserver FIFO never return. This interaction with GNU make's jobserver is something that I've not managed to reproduce -- I suspect it's something to do with recursive make invocations but am not 100% confident on that.

Steps to confirm: 1) Reproduce the problem with bear --verbose. 2) Observe that make is waiting on a read from a pipe.

   vshcmd: > gdb $(which make)
   Reading symbols from /usr/bin/make...

   This GDB supports auto-downloading debuginfo from the following URLs:
     <https://debuginfod.ubuntu.com>
   Enable debuginfod for this session? (y or [n]) 
   vshcmd: > y
   Debuginfod has been enabled.
   To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
   Downloading separate debug info for /usr/bin/make...
   Reading symbols from /home/mmalcomson/.cache/debuginfod_client/dd791a7b726a553a077eb0dc5944da16b79ab723/debuginfo...
   (gdb) 
   vshcmd: > attach 124386
   Attaching to program: /usr/bin/make, process 124386
   Reading symbols from /usr/lib/x86_64-linux-gnu/bear/libexec.so...
   Reading symbols from /home/mmalcomson/.cache/debuginfod_client/561d7a796e33a95b19405a2a6be840115969ae71/debuginfo...
   Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
   Reading symbols from /usr/lib/debug/.build-id/08/134323d00289185684a4cd177d202f39c2a5f3.debug...
   Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
   Reading symbols from /home/mmalcomson/.cache/debuginfod_client/40b9b0d17fdeebfb57331304da2b7f85e1396ef2/debuginfo...
   Reading symbols from /lib64/ld-linux-x86-64.so.2...
   Reading symbols from /usr/lib/debug/.build-id/3e/81740f816ee1521a87e439c16ebfde46f147a5.debug...
   Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
   Reading symbols from /usr/lib/debug/.build-id/98/2083caac6565fb72054d3529378d51bd903e8e.debug...
   Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
   Reading symbols from /home/mmalcomson/.cache/debuginfod_client/92123f0e6223c77754bac47062c0b9713ed363df/debuginfo...
   Downloading separate debug info for system-supplied DSO at 0x7ffccdaee000...
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
   Download failed: Invalid argument.  Continuing without source file ./io/../sysdeps/unix/sysv/linux/read.c.
   0x00007cd50e51ba61 in __GI___libc_read (fd=4, buf=buf@entry=0x7ffccdac00c7, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
   warning: 26  ../sysdeps/unix/sysv/linux/read.c: No such file or directory
   (gdb) 
   vshcmd: > bt
   #0  0x00007cd50e51ba61 in __GI___libc_read (fd=4, buf=buf@entry=0x7ffccdac00c7, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:26
   #1  0x000062b982383549 in read (__nbytes=1, __buf=0x7ffccdac00c7, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:28
   #2  jobserver_acquire_all () at ../../src/posixos.c:212
   #3  0x000062b982370783 in clean_jobserver (status=status@entry=0) at ../../src/main.c:3436
   #4  0x000062b98237085e in die (status=status@entry=0) at ../../src/main.c:3480
   #5  0x000062b98235d6a7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../../src/main.c:2613
   (gdb) 
   vshcmd: > frame 2
   #2  jobserver_acquire_all () at ../../src/posixos.c:212
   warning: Source file is more recent than executable.
   212        EINTRLOOP (r, read (job_fds[0], &intake, 1));
   (gdb) 
   vshcmd: > print job_fds
   vshcmd: > print job_rfd
   $1 = {4, -1}
   (gdb) $2 = -1
   (gdb) 
   vshcmd: > # fd 4 is what we're waiting on.
   vshcmd: > # we are waiting from 'jobserver_acquire_all'
   vshcmd: > !ls -l /proc/124386/fd
   total 0
   lrwx------ 1 mmalcomson domain-users 64 Jul 29 14:50 0 -> /dev/pts/2
   lrwx------ 1 mmalcomson domain-users 64 Jul 29 14:49 1 -> /dev/pts/2
   lrwx------ 1 mmalcomson domain-users 64 Jul 29 14:49 2 -> /dev/pts/2
   l-wx------ 1 mmalcomson domain-users 64 Jul 29 14:50 3 -> /home/mmalcomson/Documents/gnu-work/build-aarch64-none-linux-gnu/compile_commands.events.json
   lr-x------ 1 mmalcomson domain-users 64 Jul 29 14:50 4 -> 'pipe:[206643]'
   lr-x------ 1 mmalcomson domain-users 64 Jul 29 14:50 6 -> /home/mmalcomson/Documents/gnu-work/src/gcc
   (gdb) 

3) Observe that this FIFO is open in a bash process. N.b. this bash process has been reparented to systemd so is not obviously associated with make from the process tree.

   vshcmd: > !lsof | grep 206643
   lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
         Output information may be incomplete.
   make       124386                             mmalcomson    4r     FIFO               0,14        0t0     206643 pipe
   bash       334548                             mmalcomson    4r     FIFO               0,14        0t0     206643 pipe
   bash       334548                             mmalcomson    5w     FIFO               0,14        0t0     206643 pipe
   (gdb) 

4) Observe that this bash process's PID was started from a wrapper whose PPID was that of the fixincl process (and never got closed).

   [14:52:06.217067, wr, 334377, ppid: 334374] Process spawned. [pid: 334460, command: ["/home/mmalcomson/Documents/gnu-work/build-aarch64-none-linux-gnu/obj/gcc3/build-x86_64-pc-linux-gnu/fixincludes/fixincl"]]
   ...
   [14:52:06.433900, el, 334521] lib.cc; execvp file: /bin/bash
   [14:52:06.441407, el, 334521] lib.cc; on_load
   [14:52:06.441569, wr, 334521, ppid: 334460] wrapper: 3.1.3
   [14:52:06.441583, wr, 334521, ppid: 334460] arguments: ["/usr/lib/x86_64-linux-gnu/bear/wrapper", "--destination", "dns:///localhost:34517", "--verbose", "--execute", "/bin/bash", "--", "/bin/bash"]
   [14:52:06.441807, wr, 334521, ppid: 334460] environment: [... removed ...]
   [14:52:06.441982, wr, 334521, ppid: 334460] arguments parsed: {program: /usr/lib/x86_64-linux-gnu/bear/wrapper, arguments: [{--: [/bin/bash]}, {--destination: [dns:///localhost:34517]}, {--execute: [/bin/bash]}, {--verbose: []}]}
   [14:52:06.443640, wr, 334521, ppid: 334460] gRPC call requested: supervise::Supervisor::Resolve
   [14:52:06.445223, br, 124324] trying to resolve for library: /bin/bash
   [14:52:06.446048, wr, 334521, ppid: 334460] gRPC call [Resolve] finished: true
   [14:52:06.446685, wr, 334521, ppid: 334460] Process spawned. [pid: 334548, command: ["/bin/bash"]]
   [14:52:06.446811, wr, 334521, ppid: 334460] gRPC call requested: supervise::Interceptor::Register
   [14:52:06.447944, wr, 334521, ppid: 334460] gRPC call [Register] finished: true
   [14:52:06.448027, wr, 334521, ppid: 334460] Process wait requested. [pid: 334548]
   [14:52:06.448341, el, 334548] lib.cc; on_load

5) Also observe that the bash process started with no arguments (which would be this "server shell").

One could also confirm by adding a print saying "Killing PID %d" in the close_server function of fixincludes/server (or I guess by using auditctl to audit any kill syscalls). One could then clearly see that the process fixincl kills is not the server shell that is stuck around but the wrapper that got in between fixincl and it's server shell.

Sending a KILL to process: 334521

In my case killing this bash process cleared everything up and appeared to produce a compile_commands.json that had the full compilation.


Reproducer for orphaned bash process due to bear wrapper:

With the files:

vshcmd: > cat Makefile
%.o: %.c
    gcc -c $^ -o $@

fake-fixincl: fake-fixincl.o
    gcc $^ -o fake-fixincl

main: main.o
    gcc $^ -o main

.PHONY: possible_problem
possible_problem: fake-fixincl
    ./fake-fixincl

.PHONY: all
all: main possible_problem

clean:
    rm -rf *.o main fake-fixincl compile_commands.json
bear-bug-playzone [15:22:44] $ 
vshcmd: > cat fake-fixincl.c
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
#include <errno.h>

#  define ATTRIBUTE_UNUSED __attribute__ ((__unused__))
typedef enum t_bool
{
  BOOL_FALSE, BOOL_TRUE
} t_bool;

typedef struct
{
  FILE *pf_read;        /* parent read fp  */
  FILE *pf_write;       /* parent write fp */
} t_pf_pair;

typedef struct
{
  int read_fd;
  int write_fd;
} t_fd_pair;

#define NOPROCESS   ((pid_t) -1)
#define NULLPROCESS ((pid_t)0)
#define MAXPATHLEN     4096
#define EXIT_PANIC  99

static volatile enum t_bool read_pipe_timeout;
static pid_t server_master_pid = NOPROCESS;
static t_pf_pair server_pair =
{ (FILE *) NULL, (FILE *) NULL };
static pid_t server_id = NULLPROCESS;

static const char* def_args[] =
{ (char *) NULL, (char *) NULL };

/*
 *  chain_open
 *
 *  Given an FD for an inferior process to use as stdin,
 *  start that process and return a NEW FD that that process
 *  will use for its stdout.  Requires the argument vector
 *  for the new process and, optionally, a pointer to a place
 *  to store the child's process id.
 */
int
chain_open (int stdin_fd, const char** pp_args, pid_t* p_child)
{
  t_fd_pair stdout_pair;
  pid_t ch_id;
  const char *pz_cmd;

  stdout_pair.read_fd = stdout_pair.write_fd = -1;

  /*
   *  Create a pipe it will be the child process' stdout,
   *  and the parent will read from it.
   */
  if (pipe ((int *) &stdout_pair) < 0)
    {
      if (p_child != (pid_t *) NULL)
        *p_child = NOPROCESS;
      return -1;
    }

  /*
   *  If we did not get an arg list, use the default
   */
  if (pp_args == (const char **) NULL)
    pp_args = def_args;

  /*
   *  If the arg list does not have a program,
   *  assume the "SHELL" from the environment, or, failing
   *  that, then sh.  Set argv[0] to whatever we decided on.
   */
  if (pz_cmd = *pp_args,
      (pz_cmd == (char *) NULL) || (*pz_cmd == '\0'))
    {

      pz_cmd = getenv ("SHELL");
      if (pz_cmd == (char *) NULL)
        pz_cmd = "sh";
    }

#ifdef DEBUG_PRINT
  printf ("START:  %s\n", pz_cmd);
  {
    int idx = 0;

    while (pp_args[++idx] != (char *) NULL)
      printf ("  ARG %2d:  %s\n", idx, pp_args[idx]);
  }
#endif

  /*
   *  Call fork() and see which process we become
   */
  ch_id = fork ();
  switch (ch_id)
    {
    case NOPROCESS:             /* parent - error in call */
      close (stdout_pair.read_fd);
      close (stdout_pair.write_fd);
      if (p_child != (pid_t *) NULL)
        *p_child = NOPROCESS;
      return -1;

    default:                    /* parent - return opposite FD's */
      if (p_child != (pid_t *) NULL)
        *p_child = ch_id;
#ifdef DEBUG_PRINT
      printf ("for pid %d:  stdin from %d, stdout to %d\n"
              "for parent:  read from %d\n",
              ch_id, stdin_fd, stdout_pair.write_fd, stdout_pair.read_fd);
#endif
      close (stdin_fd);
      close (stdout_pair.write_fd);
      return stdout_pair.read_fd;

    case NULLPROCESS:           /* child - continue processing */
      break;
    }

  /*
   *  Close the pipe end handed back to the parent process
   */
  close (stdout_pair.read_fd);

  /*
   *  Close our current stdin and stdout
   */
  close (STDIN_FILENO);
  close (STDOUT_FILENO);

  /*
   *  Make the fd passed in the stdin, and the write end of
   *  the new pipe become the stdout.
   */
  dup2 (stdout_pair.write_fd, STDOUT_FILENO);
  dup2 (stdin_fd, STDIN_FILENO);

  if (*pp_args == (char *) NULL)
    *pp_args = pz_cmd;

  execvp (pz_cmd, (char**)pp_args);
  fprintf (stderr, "Error %d:  Could not execvp( '%s', ... ):  %s\n",
           errno, pz_cmd, strerror (errno));
  exit (EXIT_PANIC);
}

/*
 *  proc2_open
 *
 *  Given a pointer to an argument vector, start a process and
 *  place its stdin and stdout file descriptors into an fd pair
 *  structure.  The "write_fd" connects to the inferior process
 *  stdin, and the "read_fd" connects to its stdout.  The calling
 *  process should write to "write_fd" and read from "read_fd".
 *  The return value is the process id of the created process.
 */
pid_t
proc2_open (t_fd_pair* p_pair, const char** pp_args)
{
  pid_t ch_id;

  /*  Create a bi-directional pipe.  Writes on 0 arrive on 1 and vice
     versa, so the parent and child processes will read and write to
     opposite FD's.  */
  if (pipe ((int *) p_pair) < 0)
    return NOPROCESS;

  p_pair->read_fd = chain_open (p_pair->read_fd, pp_args, &ch_id);
  if (ch_id == NOPROCESS)
    close (p_pair->write_fd);

  return ch_id;
}

/*
 *  proc2_fopen
 *
 *  Identical to "proc2_open()", except that the "fd"'s are
 *  "fdopen(3)"-ed into file pointers instead.
 */
pid_t
proc2_fopen (t_pf_pair* pf_pair, const char** pp_args)
{
  t_fd_pair fd_pair;
  pid_t ch_id = proc2_open (&fd_pair, pp_args);

  if (ch_id == NOPROCESS)
    return ch_id;

  pf_pair->pf_read = fdopen (fd_pair.read_fd, "r");
  pf_pair->pf_write = fdopen (fd_pair.write_fd, "w");
  return ch_id;
}
void
close_server (void)
{
  if (  (server_id != NULLPROCESS)
     && (server_master_pid == getpid ()))
    {
      fprintf(stderr, "Sending a KILL to process: %d\n",
          server_id);
      kill ((pid_t) server_id, SIGKILL);
      server_id = NULLPROCESS;
      server_master_pid = NOPROCESS;
      fclose (server_pair.pf_read);
      fclose (server_pair.pf_write);
      server_pair.pf_read = server_pair.pf_write = (FILE *) NULL;
    }
}

static void
sig_handler (int signo ATTRIBUTE_UNUSED)
{
#ifdef DEBUG
  /* FIXME: this is illegal to do in a signal handler.  */
  fprintf (stderr,
          "fixincl ERROR:  sig_handler: killed pid %ld due to %s\n",
          (long) server_id, signo == SIGPIPE ? "SIGPIPE" : "SIGALRM");
#endif
  close_server ();
  read_pipe_timeout = BOOL_TRUE;
}

static void
server_setup (void)
{
  static int atexit_done = 0;
  char buff [MAXPATHLEN + 1];

  if (atexit_done++ == 0)
    atexit (close_server);
  else
    fputs ("NOTE: server restarted\n", stderr);

  server_master_pid = getpid ();

  signal (SIGPIPE, sig_handler);
  signal (SIGALRM, sig_handler);

  fputs ("trap : 1\n", server_pair.pf_write);
  fflush (server_pair.pf_write);
  if (getcwd (buff, MAXPATHLEN + 1) == NULL)
    buff[0] = 0;
}

int main()
{
  def_args[0] = "/bin/bash";
  server_id = proc2_fopen (&server_pair, def_args);
  if (server_id > 0)
    server_setup();
  printf("Have set things up ...\n");
  sleep(1);
  return 0;
}

bear-bug-playzone [15:22:56] $ 
vshcmd: > cat main.c
#include <stdio.h>
int main()
{
  printf("Hello World!\n");
  return 0;
}
bear-bug-playzone [15:23:03] $ 

Run the below commands ...

vshcmd: > pgrep -a bash
10048 bash
15142 bash
bear-bug-playzone [15:20:46] $ 
vshcmd: > make clean
rm -rf *.o main fake-fixincl compile_commands.json
bear-bug-playzone [15:20:49] $ 
vshcmd: > make -j 22 all
gcc -c main.c -o main.o
gcc -c fake-fixincl.c -o fake-fixincl.o
gcc main.o -o main
gcc fake-fixincl.o -o fake-fixincl
./fake-fixincl
Have set things up ...
Sending a KILL to process: 1243075
bear-bug-playzone [15:21:01] $ 
vshcmd: > pgrep -a bash
10048 bash
15142 bash
bear-bug-playzone [15:21:51] $ 
vshcmd: > make clean
rm -rf *.o main fake-fixincl compile_commands.json
bear-bug-playzone [15:21:57] $ 
vshcmd: > bear -- make -j 22 all
gcc -c main.c -o main.o
gcc -c fake-fixincl.c -o fake-fixincl.o
gcc main.o -o main
gcc fake-fixincl.o -o fake-fixincl
./fake-fixincl
Have set things up ...
Sending a KILL to process: 1244099
bear-bug-playzone [15:22:00] $ 
vshcmd: > pgrep -a bash
10048 bash
15142 bash
1244126 /bin/bash
bear-bug-playzone [15:22:06] $ 
mmalcomson commented 3 weeks ago

@rizsotto Just wondering if there's any chance of getting some sort of "ignore binaries matching this name" functionality into the intercept functionality?

In this particular case I know that fixincl doesn't spawn any compiler processes that we care about intercepting, so I figure that it might be nice to specify on the command line (or in a config file) something telling bear to avoid putting LD_PRELOAD into execution environment of fixincl.

dkm commented 3 weeks ago

Hello, I've been having the same problem and am pretty sure I know what's happening on my machine. Whether it's the same thing as what's happening on others' machine I'm not sure.

Impressive, thanks for the investigation! I'm still having the issue, so I'll try to kill the process that you've found to be the culprit (until a fix is found...)