Mula has counterproductive timeout on random endpoint

praseodym commented 1 year ago

Mula has a timeout of 5s with 5 retries when requesting the random OOI endpoint from Octopoes. For large organisations, Octopoes can take more than 5s to respond, causing Mula to time out and retry the request instead. Because Octopoes doesn't seem to cancel its request to xtdb, this causes multiple running queries in xtdb creating system load while nothing is done with the query result because Mula already timed out.

Relevant log from Mula on kattest01:

[2023-07-07 13:10:27 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:32 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:37 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:43 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:50 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:10:55 +0000] [WARNING] [scheduler.schedulers.scheduler] Could not get random oois for organisation: Alexa Top 1000 [organisation_id=alexatop1000, scheduler_id=boefje-alexatop1000]
[2023-07-07 13:12:00 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:05 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:10 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:16 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:23 +0000] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='octopoes_api', port=80): Read timed out. (read timeout=5)")': /alexatop1000/objects/random?amount=50&scan_level=1&scan_level=2&scan_level=3&scan_level=4
[2023-07-07 13:12:28 +0000] [WARNING] [scheduler.schedulers.scheduler] Could not get random oois for organisation: Alexa Top 1000 [organisation_id=alexatop1000, scheduler_id=boefje-alexatop1000]

jpbruinsslot commented 1 year ago

What is the typical response time for the random endpoint for large organisations? And would an optimization on the query side from octopoes to xtdb be helpful?

dekkers commented 1 year ago

If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.

underdarknl commented 1 year ago

As mentioned earlier, the random call is there to avoid missing out on some data the scheduler did not know about. Because of this, there is no need to randomize the actual output, or even randomly select any records. We can get the same results by just picking a random offset (withing the total count of objects) and select the next X records from there.

I am not aware of any scheduling algorithms that can function with a partial set of data. The required functionality being:

Create a reasonable list of jobs.
As jobs will come and go, there is not need to maintain a perfect schedule.
As there is no need for a complete list of jobs either, it seems wasteful to check every object against the scheduling rules.
Pause when the list of jobs is large enough and good enough for the job-workers to process for a while.
Attempt to gage the quality of the current schedule by seeing if we can easily find jobs that would receive a position on our current list, and thus replacing less interesting jobs. If so, keep adding objects, of not, pause until the job queue has some space again.

praseodym commented 1 year ago

What is the typical response time for the random endpoint for large organisations?

Looking at the kattest01 server, 12-15s seems normal.

Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.

jpbruinsslot commented 1 year ago

What is the typical response time for the random endpoint for large organisations?

Looking at the kattest01 server, 12-15s seems normal.

Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.

Yes, that can be implemented

jpbruinsslot commented 1 year ago

If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.

A suggested solution is being worked on here: https://github.com/minvws/nl-kat-coordination/issues/204

praseodym commented 1 year ago

I configured SCHEDULER_OCTOPOES_REQUEST_TIMEOUT=120 on kattest01 and now the scheduler is working again. Should we still document this setting?

minvws / nl-kat-coordination

Mula has counterproductive timeout on random endpoint #1360