greenbone / gvmd

Greenbone Vulnerability Manager - The database backend for the Greenbone Community Edition
GNU Affero General Public License v3.0
290 stars 157 forks source link

Gvm-cli connections fail under openvassd load #347

Closed asmaack closed 5 years ago

asmaack commented 5 years ago

I'm using gvm-cli (gvm-cli 2.0.0.beta1. API version 1.0.0.dev1) to connect to openvasmd on the local server to start and manage tasks, download reports etc.

An openvassd scan will spawn a lot of processes, and as a result the load on my system will increase.

At some level of load to openvassd, the gvm-cli will not connect with openvasmd and throws an error. If the load stays particularly high for some time, gvm-cli will not connect at all. This seems to be a persistent bug, tested on two different systems, easily reproducible.

Using "gvm-cli socket -c --xml ..." or "gvm-cli socket --gmp-username foo --gmp-password foo --xml ..." will yield same results. The xml string itself has no effect on this behaviour.

Error message:

$ gvm-cli socket -c --xml "" Traceback (most recent call last): File "/usr/local/bin/gvm-cli", line 11, in sys.exit(main()) File "/usr/local/lib/python3.6/dist-packages/gvmtools/cli.py", line 251, in main gvm.authenticate(args.gmp_username, args.gmp_password) File "/home/asgeir/.local/lib/python3.6/site-packages/gvm/protocols/gmpv7.py", line 198, in authenticate response = self._read() File "/home/asgeir/.local/lib/python3.6/site-packages/gvm/protocols/base.py", line 54, in _read response = self._connection.read() File "/home/asgeir/.local/lib/python3.6/site-packages/gvm/connections.py", line 275, in read data = self._socket.recv(BUF_SIZE) ConnectionResetError: [Errno 104] Connection reset by peer

$ uptime 11:03:24 up 4 days, 46 min, 2 users, load average: 32.38, 20.21, 8.70

Each time this happens, the openvasmd.log will print:

md manage:WARNING:2019-01-22 10h59.29 utc:16364: sql_exec_internal: sqlite3_step failed: interrupted md manage:WARNING:2019-01-22 10h59.29 utc:16364: sqlv: sql_exec_internal failed

Somehow, I feel this must be a bug in openvasmd, possibly within src/sql_sqlite3.c, sql_exec_internal() or any other function?

Is it possible that the openvasmd code is not honoring timeouts, or the timeouts are too short in the code? This problem happens constantly when openvassd comes under load. openvasmd itself takes no load at all.

It should be noted that at the same time, the tasks.db database can easily be accessed with just any sqlite command, and will answer instantly, like:

$ sudo sqlite3 /usr/local/var/lib/openvas/mgr/tasks.db "select * from tasks;"

So, the sqlite database file (tasks.db) itself is not at all under load or inaccessible in any way (local or remotely over ssh) under these circumstances.

My system:

$ openvassd --version OpenVAS Scanner 5.1.3

$ openvasmd --version OpenVAS Manager 7.0.4 GIT revision 03563817-openvas-manager-7.0 Manager DB revision 184

$ gvm-cli --version gvm-cli 2.0.0.beta1. API version 1.0.0.dev1

$ uname -a Linux xxx 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

timopollmeier commented 5 years ago

It looks like openvasmd is considering the connection closed too soon. Starting openvasmd with the option --client-watch-interval 0 may help for now.

I think we could possibly make the closed connection detection more stable by trying again to peek at the request from the client at least once.

asmaack commented 5 years ago

Okay, I started openvasmd with:

/usr/local/sbin/openvasmd --unix-socket=/usr/local/var/run/gvmd.sock --client-watch-interval=0

Running the same test case, I was not able to reproduce the problem. So, good - in this case anyway.

However, I'm not familiar with what the implications may be. From what I gather, this is disabling periodic checking of client (like gvm-cli, gsad) connections to openvasmd being closed or not. Why should openvasmd be checking this on a periodic level in the first place?

So, anyway, I'm starting the openvasmd daemon in this ways until further fix.

asmaack commented 5 years ago

I suspect setting --client-watch-interval=0 may have resulted in my scanner going into a state of openvasmd spawning multiple processes and now it's not answering any requests from gvm-cli whatsoever.

This happens right after running the daily /usr/local/sbin/greenbone-nvt-sync, /usr/local/sbin/greenbone-certdata-sync, /usr/local/sbin/greenbone-scapdata-sync. Possibly right after receiving the HUP signal and going into NVT cache update , rebuild, reload. See openvasmd log below.

$ ps axf
...
  586 ?        SL     0:33 openvasmd
 1152 ?        S      1:41  \_ openvasmd
20096 ?        S      0:00  \_ openvasmd: Reloading
20107 ?        S      0:35  |   \_ openvasmd: Updating
21007 ?        S      0:18  \_ openvasmd
21276 ?        S      0:13  \_ openvasmd
21363 ?        S      0:13  \_ openvasmd
21449 ?        S      0:13  \_ openvasmd
21537 ?        S      0:13  \_ openvasmd
21621 ?        S      0:13  \_ openvasmd
21706 ?        S      0:13  \_ openvasmd
...

From the openvasmd.log we see that openvasmd has lost its ability to connect/query with the sqlite database file. Yet, at same time, the sqlite tasks.db is in perfect shape and will answer to any query just instantly. So the problem is the same, or similar.

Also, after this, openvasmd will try and fail to connect to sqlite every 10 sec.

openvasmd.log:

md   main:   INFO:2019-02-06 23h00.11 utc:20096:    internal NVT cache update
md   main:   INFO:2019-02-06 23h00.11 utc:20096: rebuild_nvt_cache_retry: Reloading NVT cache
md   main:   INFO:2019-02-06 23h00.11 utc:20097: update_or_rebuild_nvt_cache: Updating NVT cache
md   main:   INFO:2019-02-06 23h00.11 utc:20097:    Updating NVT cache.
md    otp:MESSAGE:2019-02-06 23h00.11 utc:20097: Waiting for scanner to load NVTs: 26800 / 49001.
md   main:   INFO:2019-02-06 23h00.21 utc:20107: update_or_rebuild_nvt_cache: Updating NVT cache
md   main:   INFO:2019-02-06 23h00.21 utc:20107:    Updating NVT cache.
md   main:MESSAGE:2019-02-06 23h00.28 utc:20302:    OpenVAS Manager version 7.0.4 (GIT revision d21044c1-openvas-manager-7.0) (DB revision 184)
md manage:   INFO:2019-02-06 23h00.28 utc:20302:    Checking alerts.
md manage:WARNING:2019-02-06 23h06.02 utc:586: sql_exec_internal: sqlite3_step failed: cannot start a transaction within a transaction
md manage:WARNING:2019-02-06 23h06.02 utc:586: sqlv: sql_exec_internal failed
md manage:WARNING:2019-02-06 23h06.02 utc:586: manage_schedule: manage_update_nvti_cache error (Perhaps the db went down?)
md manage:WARNING:2019-02-06 23h06.12 utc:586: sql_exec_internal: sqlite3_step failed: cannot start a transaction within a transaction
...
md manage:WARNING:2019-02-07 08h39.20 utc:586: sqlv: sql_exec_internal failed
md manage:WARNING:2019-02-07 08h39.20 utc:586: manage_schedule: manage_update_nvti_cache error (Perhaps the db went down?)
md manage:WARNING:2019-02-07 08h39.30 utc:586: sql_exec_internal: sqlite3_step failed: cannot start a transaction within a transaction
md manage:WARNING:2019-02-07 08h39.30 utc:586: sqlv: sql_exec_internal failed
md manage:WARNING:2019-02-07 08h39.30 utc:586: manage_schedule: manage_update_nvti_cache error (Perhaps the db went down?)
md manage:WARNING:2019-02-07 08h39.40 utc:586: sql_exec_internal: sqlite3_step failed: cannot start a transaction within a transaction
md manage:WARNING:2019-02-07 08h39.40 utc:586: sqlv: sql_exec_internal failed
md manage:WARNING:2019-02-07 08h39.40 utc:586: manage_schedule: manage_update_nvti_cache error (Perhaps the db went down?)
md manage:WARNING:2019-02-07 08h39.50 utc:586: sql_exec_internal: sqlite3_step failed: cannot start a transaction within a transaction
...

Any gvm-cli connection will fail with the same error as before:

$ gvm-cli socket -c --xml "<get_version/>"
Traceback (most recent call last):
  File "/usr/local/bin/gvm-cli", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/gvmtools/cli.py", line 251, in main
    gvm.authenticate(args.gmp_username, args.gmp_password)
  File "/usr/local/lib/python3.6/dist-packages/gvm/protocols/gmpv7.py", line 201, in authenticate
    response = self._read()
  File "/usr/local/lib/python3.6/dist-packages/gvm/protocols/base.py", line 54, in _read
    return self._connection.read()
  File "/usr/local/lib/python3.6/dist-packages/gvm/connections.py", line 116, in read
    data = self._read()
  File "/usr/local/lib/python3.6/dist-packages/gvm/connections.py", line 82, in _read
    return self._socket.recv(BUF_SIZE)
socket.timeout: timed out
asmaack commented 5 years ago

Any news on this? Any more data we could supply?

mattmundell commented 5 years ago

Note that we're moving away from SQLite, so anything SQLite related is getting lower priority.

asmaack commented 5 years ago

What will be replacing sqlite? Are you saying there will be no support for sqlite?

bjoernricks commented 5 years ago

Currently we don't have the development power to support two very different database backends on the long term. Because a full blown:tm: database like postgres has becoming more and more important for business use cases we are going to drop the importance of sqlite. The sqlite3 db is currently only required at Greenbone for sensors like the GSM 25V. This requirement will be removed during this year.

asmaack commented 5 years ago

But surely, you're not saying the sqlite3 option will be dropped from openvasmd/gvmd !? It's fast and convenient for small installations. I suspect many many are using OpenVas in just that way, i.e. with the openvasmd default of sqlite3.

bjoernricks commented 5 years ago

All I can say is we are currently not able to fix all issues on sqlite3 and we have to rely on the community to care about it because we have to concentrate on the postgres db.

Personally I am still using sqlite but this already created some "can't reproduce" issues at @mattmundell's side. So the importance of sqlite for our business is medium and will get very low during this year.

I can't say I am fan of lowering the priority of sqlite but I had even to advocate this. With our current software stack and the number of available developers (we are currently two people for every component gsa, scanner, gvmd) we can't support both databases in the same quality.

asmaack commented 5 years ago

If we do manage to look at the code, any hints as to what should be done or changed, besides what timopollmeier said: "I think we could possibly make the closed connection detection more stable by trying again to peek at the request from the client at least once."?

asmaack commented 5 years ago

Nonexistent error/bug in gvmd 8 w. postgres.

gittygoo commented 5 years ago

I just had the exact same problem with both sqlite and postgres... Started with sqlite and observed the same as the OP then switched it all to postgres and noticed that if gvmd is running and the postgres service is restarted we start getting every 10 seconds the following:

md manage:WARNING:2019-10-22 01h39.12 utc:6042: sql_exec_internal: PQexec failed: (7) md manage:WARNING:2019-10-22 01h39.12 utc:6042: sql_exec_internal: SQL: BEGIN; md manage:WARNING:2019-10-22 01h39.12 utc:6042: sqlv: sql_exec_internal failed md manage:WARNING:2019-10-22 01h39.12 utc:6042: manage_schedule: manage_update_nvti_cache error (Perhaps the db went down?)

seems like the gvmd is not able to identify that the postgres has restarted and just keeps complaining .

Restarting the gvmd service solves it but thats not really a solution. whats the problem here since this is clearly affecting both sqlite and postgres? we are out of options for a reliable solution.

We are using the gvm 10 family on ubuntu bionic

gquittet commented 3 years ago

@timopollmeier

Hello guys, I have the same issue with OpenVas:

gvmd --version
Greenbone Vulnerability Manager 21.4.2

using

PostgreSQL 12.7 (Ubuntu 12.7-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit

On heavy load (aka creating 300 tasks to scan 300 targets), I have this output after creating ~260 tasks and targets:

gmp.get_targets(filter_string=f"{name}") \
  File "/usr/local/lib/python3.8/dist-packages/gvm/protocols/gmpv208/entities/targets.py", line 282, in get_targets
    return self._send_xml_command(cmd)
  File "/usr/local/lib/python3.8/dist-packages/gvm/protocols/base.py", line 86, in _send_xml_command
    return self.send_command(xmlcmd.to_string())
  File "/usr/local/lib/python3.8/dist-packages/gvm/protocols/base.py", line 140, in send_command
    raise e
  File "/usr/local/lib/python3.8/dist-packages/gvm/protocols/base.py", line 137, in send_command
    response = self._read()
  File "/usr/local/lib/python3.8/dist-packages/gvm/protocols/base.py", line 63, in _read
    return self._connection.read()
  File "/usr/local/lib/python3.8/dist-packages/gvm/connections.py", line 142, in read
    data = self._read()
  File "/usr/local/lib/python3.8/dist-packages/gvm/connections.py", line 105, in _read
    return self._socket.recv(BUF_SIZE)
socket.timeout: timed out

I think this issue is really the same as above, even if it's 2 years old.

Do I need to create a new one?