openvas stuck with 100% CPU in destroy_scan_globals

dhedberg commented 1 year ago

We're on the latest greenbone/ospd-openvas:stable, dated 2022-12-12T14:39:05.303393489Z, and we're seeing an issue where the openvas processes hang around after the tasks have been marked as done, taking up 100% of a CPU core.

The container is running in kubernetes with a deployment converted (with some small adaptations) from the community container docker-compose file.

Example:

root        1860 60.9  0.1 11574221292 29444 ?   tNL  Dec18 510:51 openvas --scan-start f59ff7de-7f35-4435-affe-bc7ed162b5bd

The task is already marked as finished. Attaching with gdb gives:

(gdb) bt
#0  0x00007f39f099ef1e in resize_set (reallocated_buckets_bitmap=0x7472682d5010, old_size=2660584736, hash_table=0x563b9e954760) at ../../../glib/ghash.c:834
#1  g_hash_table_resize (hash_table=0x563b9e954760) at ../../../glib/ghash.c:888
#2  0x00007f39f09a0d3e in g_hash_table_destroy (hash_table=0x563b9e954760) at ../../../glib/ghash.c:1488
#3  0x00007f39f0aa18c0 in destroy_scan_globals () from /usr/local/lib/libopenvas_misc.so.22
#4  0x0000563b9d621107 in openvas ()
#5  0x00007f39f0591d0a in __libc_start_main (main=0x563b9d61cd50 <main>, argc=3, argv=0x7ffe314163d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe314163c8)
    at ../csu/libc-start.c:308
#6  0x0000563b9d61cd8a in _start ()

it seems to be stuck in:

=> 0x00007f39f099ef00 <+960>:   add    $0x1,%ecx
   0x00007f39f099ef03 <+963>:   add    %ecx,%edx
   0x00007f39f099ef05 <+965>:   and    %r12d,%edx
   0x00007f39f099ef08 <+968>:   mov    %edx,%eax
   0x00007f39f099ef0a <+970>:   mov    %edx,%r13d
   0x00007f39f099ef0d <+973>:   shr    $0x5,%eax
   0x00007f39f099ef10 <+976>:   and    $0x1f,%r13d
   0x00007f39f099ef14 <+980>:   lea    (%rdi,%rax,4),%r11
   0x00007f39f099ef18 <+984>:   mov    (%r11),%eax
   0x00007f39f099ef1b <+987>:   bt     %edx,%eax
   0x00007f39f099ef1e <+990>:   jb     0x7f39f099ef00 <g_hash_table_resize+960>

.. which I think corresponds to

          while (get_status_bit (reallocated_buckets_bitmap, hash_val)) \
            {                                                           \
              step++;                                                   \
              hash_val += step;                                         \
              hash_val &= hash_table->mask;                             \
            }                                                           \

in glib/ghash.c

That's as far as I have dug; I'm guessing it might be some sort of memory corruption or concurrency issue?

ArnoStiefvater commented 1 year ago

Hey @dhedberg ,

Thanks for the bug report and sorry for the late reply.

We tried to replicate the issue but without success. We guess that this is a possible bug in glib.

Does this issue still happen or have you found a solution since then? Do you have some minimal reproducer, so we can replicate the issue?

dhedberg commented 1 year ago

Thanks for looking into this.

As far as I can tell from our logs and CPU usage monitoring, the issue still persists. We're working around the issue for now by regularly killing any lingering openvas --scan-start processes that matches tasks marked as Done in the API with a cronjob. We also automatically update to the latest stable images (at least) once a week.

I don't have a minimal reproducer at hand. We don't do that much customization; we create a custom port list from the IANA one with one port removed, a custom scan config from "Full and fast" with around 10 oids disabled, and have a few tasks which scan a bunch of IP ranges with ssh auth enabled. Usually there's only one task running at a time.

vedrixNL commented 1 year ago

Hi all, I seem to have the same issue.

OSPD[7] 2023-09-11 11:04:31,470: INFO: (ospd.command.command) Scan 3b93c961-e15f-460e-859a-15fa12c26368 added to the queue in position 2. OSPD[7] 2023-09-11 11:04:40,670: INFO: (ospd.ospd) Currently 1 queued scans. OSPD[7] 2023-09-11 11:04:41,325: INFO: (ospd.ospd) Starting scan 3b93c961-e15f-460e-859a-15fa12c26368. OSPD[7] 2023-09-11 11:47:09,521: INFO: (ospd.ospd) 3b93c961-e15f-460e-859a-15fa12c26368: Host scan finished. OSPD[7] 2023-09-11 11:47:09,528: INFO: (ospd.ospd) 3b93c961-e15f-460e-859a-15fa12c26368: Scan finished.

root@ospd-openvas:/ospd-openvas# ps fax PID TTY STAT TIME COMMAND 33270 pts/0 Ss 0:00 bash 33286 pts/0 R+ 0:00 _ ps fax 1 ? Ss 0:00 /sbin/docker-init -- /usr/bin/tini -- /usr/local/bin/entrypoint ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker 6 ? S 0:00 /usr/bin/tini -- /usr/local/bin/entrypoint ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker 7 ? Sl 3:19 _ /usr/bin/python3 /usr/local/bin/ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker --notus 20 ? Sl 0:09 _ /usr/bin/python3 /usr/local/bin/ospd-openvas -f --config /etc/gvm/ospd-openvas.conf --mqtt-broker-address mqtt-broker --n 1133 ? RNL 3:50 openvas --scan-start 3b93c961-e15f-460e-859a-15fa12c26368

It seems to stay in memory and not cleaning. Restarting the docker process helps, but somewhere ospd-openvas is not giving the memory claim free.

R = running or runnable (on run queue) N = low-priority (nice to other users) L = has pages locked into memory (for real-time and custom IO)

Especially the last part L is happening. Is this a bug in ospd-openvas? I've never experienced this issue before when using Debian 11.5 with same specs as below, except Debian 12

Debian 12 (bookworm) 8 vCPU 16 MEM Used the docker solution from OpenVas Latest images pulled and installed

Extra information:

Tasks: 169 total, 4 running, 165 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.8 us, 0.2 sy, 70.7 ni, 23.3 id, 0.0 wa, 0.0 hi, 2.5 si, 2.6 st MiB Mem : 15823.6 total, 249.8 free, 2185.4 used, 13855.3 buff/cache MiB Swap: 8192.0 total, 8075.0 free, 117.0 used. 13638.3 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

1153249 docker 30 10 10.8t 16228 7556 R 100.0 0.1 9,57 openvas --scan-start 57278eb0-2a35-456b-8661-ec049570ea49 1153257 docker 30 10 10.8t 16240 7568 R 99.7 0.1 9,53 openvas --scan-start c0ce0ed3-6856-48d1-9321-86d0593afcd1 1153253 docker 30 10 10.8t 16220 7552 R 99.3 0.1 9,55 openvas --scan-start 3b420acc-947c-4fd3-a4f8-2e0b94465c9c

Scans keep on running infinitely. When using strace on the pid no output is shown. The process is not a zombie. Somewhere the process is not getting killed.

vedrixNL commented 1 year ago

I've tried with the default docker-compose file on a Debian 12 server. The VM is running on Amazon btw. Doesn't make a difference, because I used a Debian 11.5 before on Amazon (same architecture) and that one worked just fine. Same docker-compose file.

jonathon2nd commented 4 months ago

I am seeing the same thing with our deployment using our helm chart: https://github.com/fpm-git/Greenbone-Community-Edition-Helm

co60ca commented 3 days ago

We've been investigating this an it appears to be similar to the OP with a more recent build of OpenVAS.

Including the quoted bit below.

(gdb) bt
#0  0x00007f39f099ef1e in resize_set (reallocated_buckets_bitmap=0x7472682d5010, old_size=2660584736, hash_table=0x563b9e954760) at ../../../glib/ghash.c:834
#1  g_hash_table_resize (hash_table=0x563b9e954760) at ../../../glib/ghash.c:888
#2  0x00007f39f09a0d3e in g_hash_table_destroy (hash_table=0x563b9e954760) at ../../../glib/ghash.c:1488
#3  0x00007f39f0aa18c0 in destroy_scan_globals () from /usr/local/lib/libopenvas_misc.so.22
#4  0x0000563b9d621107 in openvas ()
#5  0x00007f39f0591d0a in __libc_start_main (main=0x563b9d61cd50 <main>, argc=3, argv=0x7ffe314163d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe314163c8)
    at ../csu/libc-start.c:308
#6  0x0000563b9d61cd8a in _start ()

We seem to see it only when there is authenticated scans being used.

greenbone / openvas-scanner

openvas stuck with 100% CPU in destroy_scan_globals #1271