Closed praseodym closed 1 year ago
What is the typical response time for the random endpoint for large organisations? And would an optimization on the query side from octopoes to xtdb be helpful?
If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.
As mentioned earlier, the random call is there to avoid missing out on some data the scheduler did not know about. Because of this, there is no need to randomize the actual output, or even randomly select any records. We can get the same results by just picking a random offset (withing the total count of objects) and select the next X records from there.
I am not aware of any scheduling algorithms that can function with a partial set of data. The required functionality being:
What is the typical response time for the random endpoint for large organisations?
Looking at the kattest01 server, 12-15s seems normal.
Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.
What is the typical response time for the random endpoint for large organisations?
Looking at the kattest01 server, 12-15s seems normal.
Can we increase this timeout through an environment variable? It looks like no new tasks were scheduled since 13 June, almost a month ago.
Yes, that can be implemented
If you want to get random items from a list you first need to build the complete list of items. Building the list of items in this case means doing a join of all OOIs with all the scan profiles. I currently don't see how this can be implemented in a less expensive way. In my opinion the way to solve this is with a better scheduling algorithm that doesn't rely on requesting random OOIs from XTDB.
A suggested solution is being worked on here: https://github.com/minvws/nl-kat-coordination/issues/204
I configured SCHEDULER_OCTOPOES_REQUEST_TIMEOUT=120
on kattest01 and now the scheduler is working again. Should we still document this setting?
Mula has a timeout of 5s with 5 retries when requesting the random OOI endpoint from Octopoes. For large organisations, Octopoes can take more than 5s to respond, causing Mula to time out and retry the request instead. Because Octopoes doesn't seem to cancel its request to xtdb, this causes multiple running queries in xtdb creating system load while nothing is done with the query result because Mula already timed out.
Relevant log from Mula on kattest01: