feat engine: add improved WorkStealingTaskQueue

I've ran some benchmarks with this PR applied, and so far it seems really good: given the benchmark https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/C%2B%2B/userver, a 64-cores VM with 2 NUMA nodes

itrofimow@wsq-test:~/app$ lscpu 
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel Xeon Processor (Icelake)
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
    Stepping:            0
    BogoMIPS:            3990.62
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc 
                         cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fa
                         ult invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512b
                         w avx512vl xsaveopt xsavec xgetbv1 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear ar
                         ch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   2 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    128 MiB (32 instances)
  L3:                    32 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
  Srbds:                 Not affected
  Tsx async abort:       Not affected

and the application having 24 worker-threads, 7 ev-threads and a single timer-thread taskset -c 32-63 ./wsq -c userver_configs/static_config.yaml, I'm seeing these results:

work-stealing-task-queue:

itrofimow@wsq-test:~/app$ taskset -c 1-30 twrk -c 256 -t 30 -D 3s -d 300s --pin-cpus http://localhost:8080/plaintext
Running 5m test @ http://localhost:8080/plaintext
  30 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   222.83us  561.04us  292.95ms   19.00us   97.82%
    Req/Sec    42.87k     1.31k    49.65k    29.94k    73.16%
  383857503 requests in 5.00m, 53.98GB read
Requests/sec: 1279524.66
Transfer/sec:    184.26MB

global-task-queue:

itrofimow@wsq-test:~/app$ taskset -c 1-30 twrk -c 256 -t 30 -D 3s -d 300s --pin-cpus http://localhost:8080/plaintext
Running 5m test @ http://localhost:8080/plaintext
  30 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   220.95us  220.71us   21.37ms   23.00us   97.75%
    Req/Sec    37.46k   617.45     41.17k    32.47k    70.04%
  335445125 requests in 5.00m, 47.17GB read
Requests/sec: 1118150.09
Transfer/sec:    161.02MB

global_task_queue

which is an impressive ~14.4% throughput increase (keep in mind that the application in the benchmark is overtuned and not completely realistic, but still). I'm not seeing any measurable differences between global-task-queue and current develop, so from the performance point of view this PR is a gem.

Note that I haven't done any correctness nor latency tests

userver-framework / userver

feat engine: add improved WorkStealingTaskQueue #577