status-im / nim-taskpools

Lightweight, energy-efficient, easily auditable threadpool
Other
101 stars 5 forks source link

Support more than 255 cores #15

Open mratsim opened 2 years ago

mratsim commented 2 years ago
NOT 2022-03-24 18:33:33.641+00:00 Starting metrics HTTP server               topics="beacnde" url=http://aa.bb.cc.dd:pppp/metrics
WRN 2022-03-24 18:33:33.641+00:00 Unknown constants in config file           unknowns="@[\"TERMINAL_BLOCK_HASH_ACTIVATION_EPOCH\"]"
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/threads.nim(165) threadProcWrapper
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/threads.nim(156) threadProcWrapStackFrame
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/threads.nim(141) threadProcWrapDispatch
/root/nimbus-eth2/vendor/nim-taskpools/taskpools/taskpools.nim(165) workerEntryFn
/root/nimbus-eth2/vendor/nim-taskpools/taskpools/taskpools.nim(130) setupWorker
/root/nimbus-eth2/vendor/nim-taskpools/taskpools/instrumentation/contracts.nim(91) allocate
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/assertions.nim(29) failedAssertImpl
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/assertions.nim(22) raiseAssert
/root/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /root/nimbus-eth2/vendor/nim-taskpools/taskpools/instrumentation/contracts.nim(91, 15) `
capacity <= TP_MaxWorkers` 
    Contract violated for pre-condition at sparsesets.nim:41
        capacity <= TP_MaxWorkers
    The following values are contrary to expectations:
        256 <= 255  [Worker 2 on taskpool 0x0000559184beab80]
 [AssertionError]
tsoj commented 9 months ago

As long as this (or some other) limit still exists, it would be nice if TP_MaxWorkers or some wrapper for it would be public.

(In my case the user selects the number of threads, so being able to tell the user the max allowed value would be preferable to an exception)

mratsim commented 9 months ago

After some mulling over, I have found a way to fix that and implemented the fix in Constantine's threadpool.

The limitation comes from the SparseSet data structure: https://github.com/status-im/nim-taskpools/blob/15e23ef1cf0860330dcc32f50fcce5f840031e28/taskpools/sparsesets.nim#L14-L38

When a threads run out of tasks in its queue, it needs to select other threads to try to steal from. The random sampling of the next victim needs to be fast, hence the data structure needs to be able to guarantee you pick only from unpicked threads (sampling without replacement)

With a naive set, you would need to uncompress it to a sequence of only unvisited threads before randomly picking from it. With a sparseset, there is no need. But sparsesets are quite heavy in terms of memory, 2n, n the number of threads, hence allowing up to 255 (a byte) threads in those to limit memory.

An easy way would be to change from a byte to an uint32, or another way is to find a function that guarantees that over a range [0, N) each number would be visited, and only once. As a bonus we would like this to have a non-trivial random distribution (i.e. not incrementing and decrementing 1) for better load-balance/distributing contention.

There are 2 such functions that are fast and straightforward to implement:

I've implementing the first one in Constantine's threadpool:

Usage: https://github.com/mratsim/constantine/blob/777cf55628ceee1a29a59c6e70cff7000ac0a195/constantine/threadpool/threadpool.nim#L655-L666

which would map to modify randomPick in nim-taskpools: https://github.com/status-im/nim-taskpools/blob/15e23ef1cf0860330dcc32f50fcce5f840031e28/taskpools/taskpools.nim#L200-L215

Now the state is only 2 integers, however many threads there are.