swow / swow

Coroutine-based cross-platform concurrent I/O engine.
Apache License 2.0
1.19k stars 113 forks source link

[bug] sleep阻塞未按预期完成 #160

Closed he426100 closed 1 year ago

he426100 commented 1 year ago

Describe the bug / 问题描述 完成sleep的时间比预期时间提前,测试了msleep、usleep、channel->pop都一样

To Reproduce / 如何复现

<?php

ini_set('display_errors', 'on');
ini_set('display_startup_errors', 'on');
ini_set('memory_limit', '1G');

error_reporting(E_ALL);
date_default_timezone_set('Asia/Shanghai');

// for只是加速复现,没有for也一样会抛异常
for ($n = 10_000; $n--;) {
    \Swow\Coroutine::run(static function () {
        while (1) {
            $s = microtime(true);
            try {
                (new \Swow\Channel)->pop(1000);
            } catch (\Swow\ChannelException) {
            }
            // usleep(1000_000);
            // sleep(1);
            $e = microtime(true);
            if (($diff = $e - $s) < 1) {
                throw new \Exception('current: ' . $s . ', now: ' . $e . ', diff: ' . $diff);
            }
        }
    });
}

\Swow\Sync\waitAll();

Expected behavior / 正确的行为 持续运行不报错

Outputs / 程序输出

Uncaught Exception: current: 1675083606.2175, now: 1675083606.9892, diff: 0.77170491218567

Runtime Environment / 运行环境 OS:

Linux DESKTOP-KB66NQK 5.15.83.1-microsoft-standard-WSL2 #1 SMP Wed Dec 14 23:16:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

PHP:

PHP 8.1.14 (cli) (built: Jan 18 2023 05:00:47) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.1.14, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.14, Copyright (c), by Zend Technologies

Swow:

Swow

Status => enabled
Author => Swow Team
Link => https://github.com/swow/swow
Contact => Twosee <twosee@php.net>
Version => 1.1.1-dev ( ZTS )
Built => Jan 18 2023 04:59:43
Context => boost-context
Scheduler => libuv-event
SSL => OpenSSL 3.0.7 1 Nov 2022
cURL => libcurl/7.87.0 OpenSSL/3.0.7 zlib/1.2.13 brotli/1.0.9 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.51.0

Additional context / 补充说明 测试这个是发现swow下crontab任务会提前执行,反复测试后推测sleep实际阻塞的时间不符合预期。swoole下也可以触发,需要很长时间,swow是很快就会出现,协程数量越多时间偏差越大.

dixyes commented 1 year ago

裸机linux / hyperv linux没法复现 怀疑跟wsl2有关,于是我装了个wsl2(ubuntu jammy)也没复现 ,稍晚点换个试试

石乐志 可以复现

he426100 commented 1 year ago

@dixyes codespace就可以复现 codespace环境

@he426100 ➜ /workspaces/test $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:        20.04
Codename:       focal
@he426100 ➜ /workspaces/test $ uname -a
Linux codespaces-0bc4db 5.4.0-1100-azure #106~18.04.1-Ubuntu SMP Mon Dec 12 21:49:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@he426100 ➜ /workspaces/test $ box php -v
PHP 8.1.13 (cli) (built: Jan  3 2023 05:09:42) (ZTS)
Copyright (c) The PHP Group
Zend Engine v4.1.13, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.13, Copyright (c), by Zend Technologies
@he426100 ➜ /workspaces/test $ box php --ri swow

Swow

Status => enabled
Author => Swow Team
Link => https://github.com/swow/swow
Contact => Twosee <twosee@php.net>
Version => 1.0.0 ( ZTS )
Built => Jan  3 2023 05:08:04
Context => boost-context
Scheduler => libuv-event
With => OpenSSL 3.0.7 1 Nov 2022, cURL 7.87.0

测试结果
image

test.log
test.log

dixyes commented 1 year ago

我理解错了demo 还以为要取消那两个sleep的注释

先调查一下

twose commented 1 year ago

https://github.com/YvetteLau/Step-By-Step/issues/21

可以先看下这个,原理类似,但目前有个问题是超时返回比预期早而不是比预期晚,还需要再看看

twose commented 1 year ago

https://stackoverflow.com/questions/42315493/why-is-libuv-timer-drifting

libuv 定时器本身就会出现提前超时的情况,在高负载运算或大量系统调用的情况下发生几率更高(如同一时间大量创建协程,包含了大量的系统调用、内存申请操作),但即便有误差,一般都在定时的大致范围内,如误差在1毫秒内(即0.001秒,显示为0.99x.....)。

issue 中的0.77170491218567 这个数字的误差有点大,怀疑与 WSL2 有关,暂时无法在其他系统上复现。

定时器的微小误差目前没有很好的解决方式,定时器从原理上来说就很难保证它的准确,node的定时器误差更是高达2~3ms。调研了 Golang 的 Sleep 误差很小且稳定,但 strace 后发现 Golang 直接使用了 futex,每开一个 goroutine 进行 Sleep 就会阻塞一个线程,开销非常大。

想要更精准的定时器,最好是参考上面贴出来的一些参考,自己手动校正。

he426100 commented 1 year ago

跟wsl2无关,codespace可以复现 更改后的测试代码

<?php

ini_set('display_errors', 'on');
ini_set('display_startup_errors', 'on');
ini_set('memory_limit', '2G');

error_reporting(E_ALL);
date_default_timezone_set('Asia/Shanghai');

$start = microtime(true);
$wr = new \Swow\Sync\WaitReference();
\Swow\Coroutine::run(static function () use ($wr) {
    // for只是加速复现,没有for也一样会出错
    for ($n = 200_000; $n--;) {
        \Swow\Coroutine::run(static function () {
            while (1) {
                $s = microtime(true);
                try {
                    (new \Swow\Channel)->pop(1000);
                } catch (\Swow\ChannelException) {
                }
                // usleep(1000_000);
                // sleep(1);
                $e = microtime(true);
                if (($diff = $e - $s) < 1) {
                    echo '[' . microdate('Y-m-d H:i:s') . '] [' . \Swow\Coroutine::getCurrent()->getId() . '] current: ' . $s . ', now: ' . $e . ', diff: ' . $diff, PHP_EOL;
                    break;
                }
            }
        });
    }
});
\Swow\Coroutine::run(static function () use ($wr, $start) {
    while (1) {
        sleep(60);
        echo '[' . microdate('Y-m-d H:i:s') . '] 已耗时:' . (microtime(true) - $start) . ' s,剩余coroutine: ' . \Swow\Coroutine::count(), PHP_EOL;
    }
});
\Swow\Sync\WaitReference::wait($wr);
echo '[' . microdate('Y-m-d H:i:s') . '] 总耗时:' . (microtime(true) - $start) . ' s', PHP_EOL;

function microdate(string $format = 'c'): string
{
    return \DateTime::createFromFormat('0.u00 U', microtime())->setTimezone(new \DateTimeZone(date_default_timezone_get()))->format($format);
}

测试结果,协程数量到20w的时候超时提前了0.9s

@he426100 ➜ /workspaces/test $ tail -n 10  swow.log
[2023-02-04 10:10:42] [174714] current: 1675476642.4662, now: 1675476642.5737, diff: 0.10745811462402
[2023-02-04 10:10:42] [174715] current: 1675476642.4663, now: 1675476642.5738, diff: 0.10751414299011
[2023-02-04 10:10:42] [174716] current: 1675476642.4663, now: 1675476642.5738, diff: 0.10757088661194
[2023-02-04 10:10:42] [174717] current: 1675476642.4663, now: 1675476642.5739, diff: 0.10762500762939
[2023-02-04 10:10:42] [174718] current: 1675476642.4663, now: 1675476642.5739, diff: 0.10767889022827
[2023-02-04 10:10:42] [174719] current: 1675476642.4663, now: 1675476642.574, diff: 0.10773301124573
[2023-02-04 10:10:42] [174720] current: 1675476642.4663, now: 1675476642.5741, diff: 0.1077880859375
[2023-02-04 10:10:42] [174721] current: 1675476642.4663, now: 1675476642.5741, diff: 0.10784292221069
[2023-02-04 10:11:15] 已耗时:60.032107114792 s,剩余coroutine: 2

完整日志 swow.log.zip

twose commented 1 year ago
<?php

use Swow\Coroutine;
use Swow\Sync\WaitReference;

date_default_timezone_set('Asia/Shanghai');

$start = microtime(true);
$wr = new WaitReference();
Coroutine::run(static function () use ($wr) {
    for ($i = 0; $i < 100_000_000; $i++) {
        // mock CPU blocking
    }
    $s = microtime(true);
    try {
        (new \Swow\Channel)->pop(1000);
    } catch (\Swow\ChannelException) {
    }
    $e = microtime(true);
    if (($diff = $e - $s) < 1) {
        logger('current: ' . $s . ', now: ' . $e . ', diff: ' . $diff);
    }
});
WaitReference::wait($wr);
logger('总耗时:' . (microtime(true) - $start) . ' s');

function logger(string $message, string $level = 'log')
{
    echo sprintf(
        '%s %s %s %s%s',
        '[' . microdate('Y-m-d H:i:s.v') . ']',
        '[' . $level . ']',
        '[' . Coroutine::getCurrent()->getId() . ']',
        $message,
        "\n"
    );
}

function microdate(string $format = 'c'): string
{
    return \DateTime::createFromFormat('0.u00 U', microtime())->setTimezone(new \DateTimeZone(date_default_timezone_get()))->format($format);
}

由于机器性能不够,长时间的 CPU 阻塞导致的定时器漂移。 底层没有一个皆大欢喜的解决办法,除非使用 Watchdog 模块进行调度,可以减少误差,其实就是引入外部监管来对阻塞进行修正。 要么就牺牲总体性能,总是从系统取最新的时间,而不是拿事件循环的缓存时间,这里需要修改 libuv 底层代码实现。

这种情况我觉得应该和 CPU阻塞 同等处理,如通过预热协程或协程池的方式缓解CPU负载,稍等我写几个demo。

twose commented 1 year ago

不带 Watchdog (误差极大)

image

带Watchdog (误差相对稳定)

image

只需要加一行

Watchdog::run(1_000_000, -1, 0); // 阻塞超过1ms时进行一次缓冲0ms的调度
twose commented 1 year ago

另一种方式是在协程创建后进行一次缓冲操作

image

这样程序将会表现出与 nodejs 一致的行为,因为缓冲后定时器起始时间会得到校正,但定时器会由于程序阻塞而延后执行。

let s = Date.now();
setTimeout(function () {
    let e1 = Date.now();
    console.log('1:', e1 - s);
}, 1000);
for (let i = 0; i < 2 * 1000 * 1000 * 1000; i++) {
}
setTimeout(function () {
    let e2 = Date.now();
    console.log('2:', e2 - s);
}, 1000);
image
twose commented 1 year ago

定时提前触发的原因在于,由于长时间的阻塞,定时器的初始的缓存时间没有办法得到及时更新,因此初始时间+timeout时间就会过小,但microtime(true)是从系统拿的,会实时更新。

根本问题在于机器性能并不足以支撑这么多协程同时运行,20万个协程同时运行,假设每个协程中的逻辑代码需要1微秒的执行时间,全部执行一遍也需要200ms,协程调度一次在几百纳秒左右,这还是在我机器上的性能,如果在低配的机器上,再加上一些系统调用的开销,误差会非常巨大。

而在我们机器上测不出来超大误差的原因是,我们的机器性能足够支撑这么多协程运行…… 但我们可以通过 for 循环模拟更长的死循环来重现这个问题。

he426100 commented 1 year ago

明白了,感谢解答