Scans interrupted with error "Cannot connect to Scanner

dustinbird commented 6 months ago

I have a new issue that has happened on one of our scanner boxes that has been working without errors up until this (while I was away). All the current scheduled scans (and when manually triggered) are failing with a status of interrupted. When I look at the task the only error message I see is "Cannot connect to scanner" but no detailed log.

As mentioned this was working until Friday when I took annual leave but seems to have started with scans failing from Saturday onwards. Last Thursday I updated the docker image to get the latest version and the same was run on another box the following day but this is unaffected. I have today updated the docker image again, I have ran all available servers patches and rebooted the server (OpenSUSE Leap 15.5).

When I look at the container log I see the following error:

osp_scanner_feed_version: failed to connect to /var/run/ospd/ospd-openvas.sock\n","stream":"stdout","time":"2024-05-02T12:55:27.127335941Z"}

this file does not exist on this server but nor does it on another scanning server set-up the same way running the same OpenSUSE build and GVM versions with no errors.

What can I do to correct this?

immauss commented 6 months ago

@dustinbird

That file should exist in the container. try:

docker exec <openvas container name> ls -l /var/run/ospd/

Or ... use interactive exec to enter the container:

docker exec -it <openvas container name> bash

Then navigate to /var/run/ospd as if you were on another Linux server.

BUT ... from what you describe, it sounds like ospd-openvas has died.

Could you attach the full contents of the log from the container ..

docker logs <container name>  > container.log

Then attach the "container.log" file to this issue.

Also, I'm guessing if you check "docker ps", that the container does not show as healthy?

-Scott

dustinbird commented 6 months ago

Thank you for coming back to me on this.

Yes I can now see the file when I navigate that way, so it is in the docker container.

When I look at Docker PS, I can see it has a status of unhealthy. I think I have got into the habit of seeing it running lately and able to access the web interface so ignored that important piece of information.

The log you requested is as below, I hope this is correct for you:

cp: cannot stat '/usr/local/var/lib/': No such file or directory chown: invalid user: ‘gvm:gvm’ cp: cannot stat '/var/lib/gvm/': No such file or directory cp: cannot stat '/var/lib/notus/': No such file or directory cp: cannot stat '/var/lib/openvas/': No such file or directory cp: cannot stat '/etc/gvm/': No such file or directory cp: cannot stat '/usr/local/etc/openvas/': No such file or directory gpg: keybox '/etc/openvas-gnupg/pubring.kbx' created gpg: /etc/openvas-gnupg/trustdb.gpg: trustdb created gpg: key 9823FAA60ED1E580: public key "Greenbone Community Feed integrity key" imported gpg: Total number processed: 1 gpg: imported: 1 gpg: inserting ownertrust of 6 Traceback (most recent call last): File "/usr/local/bin/ospd-openvas", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/daemon.py", line 1264, in main daemon_main('OSPD - openvas', OSPDopenvas, NotusParser()) File "/usr/local/lib/python3.11/dist-packages/ospd/main.py", line 152, in main daemon.init(server) File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/daemon.py", line 528, in init self.update_vts() File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/daemon.py", line 652, in update_vts self.notus.reload_cache() File "/usr/local/lib/python3.11/dist-packages/ospd_openvas/notus.py", line 138, in reload_cache data = json.loads(f.read_bytes()) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 7 column 1 (char 203) Exception ignored in atexit callback: <function exit_cleanup at 0x7f6c7d971440> Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/ospd/main.py", line 74, in exit_cleanup sys.exit() SystemExit: free(): double free detected in tcache 2 malloc(): smallbin double linked list corrupted corrupted double-linked list malloc(): smallbin double linked list corrupted

immauss commented 6 months ago

Can you please share how you are starting the container? Full command or the docker-compose.yml

Also, make sure there are no issues with the underlying storage. At first glance, that's all I can think it could be, though that usually shows as postgresql failing out because it can't write to the DB. For me, that has happened when the disc gets full, but have seen it through other issues as well.

Just noticed ... all the "cannot stat" errors, are to the persistent storage. Are you using a local directory or a docker volume for your persistent storage? Make sure, where ever it is, that there are no issues with the underlying discs.

-Scott

dustinbird commented 6 months ago

The command the container is started with is as follows:

docker run --detach --publish 8443:9392 -e PASSWORD="REDACTED" --volume openvas:/data -e HTTPS=true --name openvas immauss/openvas:latest

The system has the following space so this does not appear to be an issue and the device appears to be healthy. I could try running an old volume backup to see if this can point to whether or not it is a disk / DB issue.

Filesystem Size Used Avail Use% Mounted on devtmpfs 4.0M 8.0K 4.0M 1% /dev tmpfs 3.8G 4.0K 3.8G 1% /dev/shm tmpfs 1.6G 9.0M 1.6G 1% /run tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup /dev/vda2 48G 6.4G 42G 14% / /dev/vdb1 20G 14G 6.8G 67% /var/lib/docker overlay 20G 14G 6.8G 67% /var/lib/docker/overlay2/0c5e2c0ca570a75ec2fb7b7ea838a2a19dcff926f7fc139bad45d18eb6928b90/merged tmpfs 3.8G 0 3.8G 0% /var/lib/docker/containers/93415de6a729efae58bf0ed3181fd23914d71f8243416500d5f95b678f10be2c/mounts/secrets tmpfs 775M 40K 775M 1% /run/user/0

dustinbird commented 6 months ago

Is there an easy way to do a health check on the volume or more likely the DB?

immauss commented 6 months ago

It's not the DB. You aren't getting errors from postgresql.

You could fsck the disc that holds /var/lib/docker

-Scott

dustinbird commented 6 months ago

We have checked the disk and there was no errors shown and no space issues. I have rolled the mounted dfrive back to an old snapshot and this is now working as it should. Frustrating not to see an obvious issue that caused this but relieved it is working.

Thank you for the support as always.

dustinbird commented 6 months ago

I spoke too soon. The old working snapshot has once again given me the same issues.

dustinbird commented 6 months ago

I think we will build a new box from the ground up. I can see how to export to xml tasks, hosts etc but now how to import. Is there a simple article I can see about transferring the data from one to another?

immauss commented 6 months ago

Before you do that ....

Start by creating a new container. ( new name & new volume name)

docker run --detach --publish 8443:9392 -e PASSWORD="REDACTED" --volume openvas-new:/data -e HTTPS=true --name openvas-new immauss/openvas:latest

Then take a look at the docs

and follow the steps for database backup and restoral.

restore from the old container to the new.

Much easier than exporting everything to XML and then importing. This will also get all of you results from previous scans.

-Scott

dustinbird commented 6 months ago

thank you.

grandaor commented 6 months ago

exactely the same issue here and i just do a restart of my compose :( Same errors with python 3.11 in container

grandaor commented 6 months ago

same on a new box with this simple compose :

version: '3.8'

services:

  openvas:
    image: immauss/openvas
    container_name: openvas
    restart: unless-stopped
    environment:
      PASSWORD: REDACTED
      HTTPS: false
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./data:/data
    ports:
      - 8080:9392

new dock, new bind volume

test with a new docker volume (instead of bind volume), same :'(

grandaor commented 6 months ago

you're not alone @dustinbird :'(

immauss commented 6 months ago

@grandaor Can you give me the full scenario? How long has it been running?

When was the last time you updated the container?

Can you try the solution I suggested to @dustinbird?

Thanks, -Scott

immauss commented 6 months ago

Could you also send the results of the following command?

docker exec -it openvas bash -c "ls -l /data/var* /data/local*"

Thanks, -Scott

grandaor commented 6 months ago

i've this issue today when i did a restart of my compose in prod (included traefik, etc...) this container was updated and worked on : — 30/04/2024 21:39.... until today so, i try on a new box with the simply compose above and fresh deploy (without data) and exactely same errors...

i noticed one thing :

root@4311fa403057:/var/run/ospd# ls -l
total 0
lrwxrwxrwx 1 root root 31 May  3 21:44 ospd.sock -> /var/run/ospd/ospd-openvas.sock

ospd.sock seem to be a symbolic link on file socket that doesn't appear in the same directory

openvas | md manage:WARNING:2024-05-03 22h05.47 CEST:1308: osp_scanner_feed_version: failed to connect to /var/run/ospd/ospd-openvas.sock

grandaor commented 6 months ago

bob@marley:~/docker/openvas$ docker exec -it openvas bash -c "ls -l /data/var* /data/local*"  
/data/local-etc:
total 8
drwxr-xr-x 2 root root 4096 May  3 21:42 gvm
drwxr-xr-x 3 root root 4096 May  3 21:44 openvas

/data/local-share:
total 36
drwxr-xr-x 2 root root  4096 Apr 24 21:15 ca-certificates
drwxr-xr-x 3 root root  4096 Apr 24 22:04 doc
drwxrwsr-x 2 root staff 4096 Apr 24 21:15 fonts
drwxr-xr-x 6 gvm  gvm   4096 Apr 24 23:00 gvm
drwxr-xr-x 4 root root  4096 Apr 24 22:04 man
drwxr-xr-x 2 root root  4096 Apr 24 22:05 openvas
drwxr-xr-x 7 root root  4096 Apr 24 21:15 sgml
drwxr-xr-x 2 root root  4096 Apr 24 21:15 texmf
drwxr-xr-x 6 root root  4096 Apr 24 22:03 xml

/data/var-lib:
total 24
-rw-r--r-- 1 gvm gvm   11 May  3 21:43 FeedDate
drwxr-xr-x 8 gvm gvm 4096 Aug 13  2023 gvm
drwxr-xr-x 2 gvm gvm 4096 Aug 13  2023 mosquitto
drwxrwxr-x 4 gvm gvm 4096 May  3 21:43 notus
drwxr-xr-x 3 gvm gvm 4096 Aug 13  2023 openvas
-rw-r--r-- 1 gvm gvm   29 Apr 25 04:23 update.ts

/data/var-log:
total 32
-rw-r--r-- 1 postgres  root     19193 May  3 21:43 db-restore.log
drwxr-xr-x 2 gvm       gvm       4096 May  3 21:44 gvm
drwxr----- 2 mosquitto root      4096 May  3 21:42 mosquitto
drwxr-xr-x 2 postgres  postgres  4096 May  3 21:42 postgresql

grandaor commented 6 months ago

note: i destroyed container each time and this issue appear with the solution of @dustinbird , i think his container was not destroy when he restored his snapshot

immauss commented 6 months ago

You destroyed the container, but not your "./data"

I think I know what might be the problem, but need to test a few things ....

And the soft link is correct for the socket.

grandaor commented 6 months ago

it was my 1st test to delete my data :'( and i test it on a new box (totally fresh) and same issue

grandaor commented 6 months ago

what's your idea scott ? :angel:

immauss commented 6 months ago

If you get the same issue with a clean /data ... then my fist thought is not correct.

Strange ... I've tested on multiple machines and have indications of > 200 others using that version as well ....

but ... as I'm writing this, I just got the same error.

Now that I have reproduced it, I should be able to work it out.

Sorry ...

You might need to roll back to the previous version while I work this out. It's pretty late here and I have an early day tomorrow. So if I don't work this out really soon, it might a day or two before I can resolve it.

-Scott

immauss commented 6 months ago

NVM ... Found the issue. Will take me a few to fix it.

grandaor commented 6 months ago

don't worry :)

you're already my heroe when you resolved : https://github.com/immauss/openvas/issues/234

have a good night :)

grandaor commented 6 months ago

what is it ? i'm curious :)

immauss commented 6 months ago

cron to generate the notus files blew up. Your getting bad data from my rsync server.

immauss commented 6 months ago

Ugh .... OK ... if you restart the container now, it should pull good data, and work. Sorry about that .... That's the first time that cron has run, so naturally there was something unanticipated .... which I still need to figure out.

@grandaor and @dustinbird Thank you for bringing this to my attention.

grandaor commented 6 months ago

i test it immediately :)

grandaor commented 6 months ago

You rocks dude <3

openvas | md main:MESSAGE:2024-05-03 20h44.40 utc:826: Greenbone Vulnerability Manager version 23.5.2 (DB revision 255) openvas | md manage: INFO:2024-05-03 20h44.40 utc:826: Getting scanners. openvas | md main:MESSAGE:2024-05-03 20h44.43 utc:833: Greenbone Vulnerability Manager version 23.5.2 (DB revision 255) openvas | md manage: INFO:2024-05-03 20h44.43 utc:833: Verifying scanner. openvas | openvas | ==> /usr/local/var/log/gvm/healthchecks.log <== openvas | Healthchecks completed with no issues.

grandaor commented 6 months ago

you are not to be sorry about that you are so prompt to resolve ! i'm loving you scott <3

immauss commented 6 months ago

@grandaor @dustinbird Thanks again for bringing this to my attention.

The problem was an issue with the cron which creates the new notus feeds for Rocky. This has been resolved. I've also added error checking to validate the newly created notus feeds BEFORE the existing working feeds are replaced. SO ... hopefully ... this won't happen again. (But I'm sure you'll let me know if it does. 😉)

Thanks, Scott

dustinbird commented 6 months ago

Thank yiou so much for sorting this. I was away for the weekend and when I logged in to look at your fix I found that a second box of ours had failed too, one week after the first. I have updated the docker image on both and now they seem to be working as they should.

Thank you so much for all the work on this.

dustinbird commented 6 months ago

Can we open this issue again please or have another look at this as since the latest docker image was applied to the servers fixing the interrupted at the start issue I now have some scans coming back as interrupted at 100% which I never had before. Has you or anyone encountered this since the latest image was applied.

We have a scan that takes 1 minutes or so where a scan box scans the other box to make sure that one is ok. but this is giving two error messages after taking an incredible amount of time rather than the expected 1 minute:

"Error exit, setting running task to Interrupted."

immauss / openvas

Scans interrupted with error "Cannot connect to Scanner #269