minvws / nl-kat-coordination

OpenKAT scans networks, finds vulnerabilities and creates accessible reports. It integrates the most widely used network tools and scanning software into a modular framework, accesses external databases such as shodan, and combines the information from all these sources into clear reports. It also includes lots of cat hair.
https://openkat.nl
European Union Public License 1.2
126 stars 58 forks source link

Mula has counterproductive timeout on random endpoint #1360

Closed praseodym closed 1 year ago

praseodym commented 1 year ago

Mula has a timeout of 5s with 5 retries when requesting the random OOI endpoint from Octopoes. For large organisations, Octopoes can take more than 5s to respond, causing Mula to time out and retry the request instead. Because Octopoes doesn't seem to cancel its request to xtdb, this causes multiple running queries in xtdb creating system load while nothing is done with the query result because Mula already timed out.

Relevant log from Mula on kattest01:

[2023-07-07 13:10:27 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:32 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:37 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:43 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:50 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:55 +0000] [WARNING] [scheduler.schedulers.scheduler] Could not get random oois for organisation: Alexa Top 1000 [organisation_id=alexatop1000, scheduler_id=boefje-alexatop1000]
[2023-07-07 13:12:00 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:05 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:10 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:16 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:23 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:28 +0000] [WARNING] [scheduler.schedulers.scheduler] Could not get random oois for organisation: Alexa Top 1000 [organisation_id=alexatop1000, scheduler_id=boefje-alexatop1000]
jpbruinsslot commented 1 year ago

What is the typical response time for the random endpoint for large organisations? And would an optimization on the query side from octopoes to xtdb be helpful?

dekkers commented 1 year ago

If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.

underdarknl commented 1 year ago

As mentioned earlier, the random call is there to avoid missing out on some data the scheduler did not know about. Because of this, there is no need to randomize the actual output, or even randomly select any records. We can get the same results by just picking a random offset (withing the total count of objects) and select the next X records from there.

I am not aware of any scheduling algorithms that can function with a partial set of data. The required functionality being:

praseodym commented 1 year ago

What is the typical response time for the random endpoint for large organisations?

Looking at the kattest01 server, 12-15s seems normal.

Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.

jpbruinsslot commented 1 year ago

What is the typical response time for the random endpoint for large organisations?

Looking at the kattest01 server, 12-15s seems normal.

Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.

Yes, that can be implemented

jpbruinsslot commented 1 year ago

If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.

A suggested solution is being worked on here: https://github.com/minvws/nl-kat-coordination/issues/204

praseodym commented 1 year ago

I configured SCHEDULER_OCTOPOES_REQUEST_TIMEOUT=120 on kattest01 and now the scheduler is working again. Should we still document this setting?