roadrunner-server / roadrunner

🤯 High-performance PHP application server, process manager written in Go and powered with plugins
https://docs.roadrunner.dev
MIT License
7.83k stars 407 forks source link

[🐛 BUG]: Streaming responses can experience a lock-up when the client disconnects early #1830

Closed segrax closed 8 months ago

segrax commented 8 months ago

No duplicates 🥲.

What happened?

When a browser disconnects early while receiving a response streamed from PHP, the worker / PHP can lock up.

The worker never recovers by itself

The chunk size heavily effects the chance of this occurring (eg. 512000 bytes is likely to lockup first try, where as 512 bytes could take 100s of attempts, and 51200 bytes will lockup within a few)

This is a dump of the last ~100 executed PHP functions/class methods executed by the php process, the final instruction is where the process is left stuck: fwrite(Unknown Type, '', 51216)

This is execution of "Spiral\RoadRunner\Http\HttpWorker::respondStream" (first instruction here is line 125: $body->next();)

[2023-12-28 06:51:09]: [BEGIN] Generator::next()
[2023-12-28 06:51:09]: [BEGIN] {closure}()
[2023-12-28 06:51:09]: [BEGIN] feof(Unknown Type)
[2023-12-28 06:51:09]: [END]   feof(): false
[2023-12-28 06:51:09]: [BEGIN] fread(Unknown Type, 51200)
[2023-12-28 06:51:09]: [END]   fread(): '�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                     h�Ys�s4�   �=�}Oe��lh �jP�'
[2023-12-28 06:51:09]: [END]   {closure}(): '�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                         h�Ys�s4�       �=�}Oe��lh �jP�'
[2023-12-28 06:51:09]: [END]   next(): NULL
[2023-12-28 06:51:09]: [BEGIN] Generator::valid()
[2023-12-28 06:51:09]: [END]   valid(): true
[2023-12-28 06:51:09]: [BEGIN] Generator::current()
[2023-12-28 06:51:09]: [END]   current(): '�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                       h�Ys�s4� �=�}Oe��lh �jP�'
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::getPayload('Spiral\RoadRunner\Message\Command\StreamStop')
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::findPayload('Spiral\RoadRunner\Message\Command\StreamStop')
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::pullPayload()
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::haveToPing()
[2023-12-28 06:51:09]: [END]   haveToPing(): NULL
[2023-12-28 06:51:09]: [END]   pullPayload(): NULL
[2023-12-28 06:51:09]: [END]   findPayload(): NULL
[2023-12-28 06:51:09]: [END]   getPayload(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Payload::__construct('�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                                                   h�Ys�s4�     �=�}Oe��lh �jP�', NULL, false)
[2023-12-28 06:51:09]: [END]   __construct(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::respond(Object(Spiral\RoadRunner\Payload))
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::send('�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                                           h�Ys�s4�     �=�}Oe��lh �jP�', '', false)
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\Frame::__construct('�)�6�@JSUs)'$'V�<�\�ӌ���U%+�
                                                                                              h�Ys�s4�  �=�}Oe��lh �jP�', Array)
[2023-12-28 06:51:09]: [END]   __construct(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::sendFrame(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\StreamRelay::send(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\Frame::packFrame(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] pack('CCL', 20, 0, 51200)
[2023-12-28 06:51:09]: [END]   pack(): ''
[2023-12-28 06:51:09]: [BEGIN] crc32('')
[2023-12-28 06:51:09]: [END]   crc32(): 3057198806
[2023-12-28 06:51:09]: [BEGIN] pack('LCCL*', 3057198806, 1, 0, 0)
[2023-12-28 06:51:09]: [END]   pack(): '�&9�'
[2023-12-28 06:51:09]: [END]   packFrame(): ''
[2023-12-28 06:51:09]: [BEGIN] error_clear_last()
[2023-12-28 06:51:09]: [END]   error_clear_last(): NULL
[2023-12-28 06:51:09]: [BEGIN] fwrite(Unknown Type, '', 51216)
[2023-12-28 06:51:09]: [END]   fwrite(): 51216
[2023-12-28 06:51:09]: [END]   send(): NULL
[2023-12-28 06:51:09]: [END]   sendFrame(): NULL
[2023-12-28 06:51:09]: [END]   send(): NULL
[2023-12-28 06:51:09]: [END]   respond(): NULL
[2023-12-28 06:51:09]: [BEGIN] Generator::next()
[2023-12-28 06:51:09]: [BEGIN] {closure}()
[2023-12-28 06:51:09]: [BEGIN] feof(Unknown Type)
[2023-12-28 06:51:09]: [END]   feof(): false
[2023-12-28 06:51:09]: [BEGIN] fread(Unknown Type, 51200)
[2023-12-28 06:51:09]: [END]   fread(): '�攠<y��C���6�Mh�
                                                         yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t        Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����'
[2023-12-28 06:51:09]: [END]   {closure}(): '�攠<y��C���6�Mh�
                                                             yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t            Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����'
[2023-12-28 06:51:09]: [END]   next(): NULL
[2023-12-28 06:51:09]: [BEGIN] Generator::valid()
[2023-12-28 06:51:09]: [END]   valid(): true
[2023-12-28 06:51:09]: [BEGIN] Generator::current()
[2023-12-28 06:51:09]: [END]   current(): '�攠<y��C���6�Mh�
                                                           yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t          Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����'
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::getPayload('Spiral\RoadRunner\Message\Command\StreamStop')
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::findPayload('Spiral\RoadRunner\Message\Command\StreamStop')
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::pullPayload()
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::haveToPing()
[2023-12-28 06:51:09]: [END]   haveToPing(): NULL
[2023-12-28 06:51:09]: [END]   pullPayload(): NULL
[2023-12-28 06:51:09]: [END]   findPayload(): NULL
[2023-12-28 06:51:09]: [END]   getPayload(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Payload::__construct('�攠<y��C���6�Mh�
                                                                                       yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t                                      Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����', NULL, false)
[2023-12-28 06:51:09]: [END]   __construct(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::respond(Object(Spiral\RoadRunner\Payload))
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::send('�攠<y��C���6�Mh�
                                                                               yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t                              Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����', '', false)
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\Frame::__construct('�攠<y��C���6�Mh�
                                                                                  yH�f7��?�_i�CE<�@�Dm�"��$=fLzT��gI{�M��
ˤeq�(ZL�+���aQk'0T���F�P�X+��l찻6���7��d���5U[�d��`u�N����ٽ�c��Oq�Jj���@}J��Z��D�|�.�"��t                                 Z޹��/��Z<�=�
                                                                                         ���cd�����/D�W���Bԕ���ɨ�-�u@}�"��VC�:M�=}��<��l1�wZ�ւ�DS���϶�xQv������1���^��{���i8����^I�P�b&�        x���_�d�f-�a�؆KG��|D��T]���k�\em�[���b6��v_�Q×A��EY3���|�,�����~4p��
��0w�`�M���pc$n%з�ȇ*Q�C*~mN�&�����J��`����', Array)
[2023-12-28 06:51:09]: [END]   __construct(): NULL
[2023-12-28 06:51:09]: [BEGIN] Spiral\RoadRunner\Worker::sendFrame(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\StreamRelay::send(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] Spiral\Goridge\Frame::packFrame(Object(Spiral\Goridge\Frame))
[2023-12-28 06:51:09]: [BEGIN] pack('CCL', 20, 0, 51200)
[2023-12-28 06:51:09]: [END]   pack(): ''
[2023-12-28 06:51:09]: [BEGIN] crc32('')
[2023-12-28 06:51:09]: [END]   crc32(): 3057198806
[2023-12-28 06:51:09]: [BEGIN] pack('LCCL*', 3057198806, 1, 0, 0)
[2023-12-28 06:51:09]: [END]   pack(): '�&9�'
[2023-12-28 06:51:09]: [END]   packFrame(): ''
[2023-12-28 06:51:09]: [BEGIN] error_clear_last()
[2023-12-28 06:51:09]: [END]   error_clear_last(): NULL
[2023-12-28 06:51:09]: [BEGIN] fwrite(Unknown Type, '', 51216)

Version (rr --version)

2023.3.8 (build time: 2023-12-14T16:05:26+0000, go1.21.5), OS: linux, arch: amd64

How to reproduce the issue?

The worker.php below is almost identical to the example provided in the documentation, except its reading from a large (5gb) file (DIR . '/IMG_0192.MOV') -- size only important to the point of giving you enough time to cancel

  1. Put a large file with worker.php, named 'IMG_0192.MOV'
  2. Start RR
  3. Access with a browser (http://localhost:8080/)
  4. Start download
  5. Cancel download
  6. Restart download (may need to repeat 5 & 6 numerous times)
  7. it will now sit at 0 bytes, and eventually fail with a 500 (console will error with 'no free workers')

rr.yaml

version: '3'
rpc:
  listen: tcp://127.0.0.1:6001
server:
  command: "php worker.php"
http:
  address: "0.0.0.0:8080"
  access_logs: false
  middleware: [ ]
  pool:
    num_workers: 1

worker.php

<?php

require __DIR__ . '/vendor/autoload.php';

use Spiral\RoadRunner;

ini_set('display_errors', 'stderr');
require __DIR__ . "/vendor/autoload.php";

$worker = RoadRunner\Worker::create();
$http = new RoadRunner\Http\HttpWorker($worker);
$read = static function (): Generator {
    $handle = fopen(__DIR__ . '/IMG_0192.MOV', 'rb');

    if ($handle === false) {
        throw new \RuntimeException("Cannot open file");
    }

    try {
        while (!feof($handle)) {
            yield fread($handle, 512000);
        }
    } catch (Spiral\RoadRunner\Http\Exception\StreamStoppedException) {
        // Just stop sending data

        return;
    } finally {
        fclose($handle);
    }
};

try {
    while ($req = $http->waitRequest()) {
        $http->respond(200, $read());
    }
} catch (\Throwable $e) {
    $worker->error($e->getMessage());
}

Relevant log output

Attaching to rr_test-roadrunner-1
rr_test-roadrunner-1  | 2023-12-28T06:50:59+0000        DEBUG   rpc             plugin was started      {"address": "tcp://127.0.0.1:6001", "list of the plugins with RPC methods:": ["resetter", "app", "lock", "informer"]}
rr_test-roadrunner-1  | 2023-12-28T06:50:59+0000        DEBUG   server          worker is allocated     {"pid": 20, "internal_event_name": "EventWorkerConstruct"}
rr_test-roadrunner-1  | 2023-12-28T06:50:59+0000        DEBUG   http            http server was started {"address": "0.0.0.0:8080"}
rr_test-roadrunner-1  | [INFO] RoadRunner server started; version: 2023.3.8, buildtime: 2023-12-14T16:05:26+0000
rr_test-roadrunner-1  | [INFO] sdnotify: not notified
rr_test-roadrunner-1  | 2023-12-28T06:51:00+0000        DEBUG   server          stream mode     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        ERROR   http            write response error    {"start": "2023-12-28T06:51:00+0000", "elapsed": "1.118407309s", "error": "write tcp 172.19.0.2:8080->172.19.0.1:41228: write: broken pipe"}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        INFO    http            http log        {"status": 200, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:41228", "read_bytes": 0, "write_bytes": 140087290, "start": "2023-12-28T06:51:00+0000", "elapsed": "1.118555007s"}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        DEBUG   server          stream was canceled, sending stop bit   {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        DEBUG   server          stop bit was sent, waiting for the response     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        DEBUG   server          fsm transition error    {"error": "fsm_recognizer: can't transition from state: ready"}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        DEBUG   server          stream has ended        {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:02+0000        DEBUG   server          stream mode     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        ERROR   http            write response error    {"start": "2023-12-28T06:51:02+0000", "elapsed": "1.590039152s", "error": "write tcp 172.19.0.2:8080->172.19.0.1:41240: write: broken pipe"}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        INFO    http            http log        {"status": 200, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:41240", "read_bytes": 0, "write_bytes": 201220090, "start": "2023-12-28T06:51:02+0000", "elapsed": "1.59017066s"}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        DEBUG   server          stream was canceled, sending stop bit   {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        DEBUG   server          stop bit was sent, waiting for the response     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        DEBUG   server          fsm transition error    {"error": "fsm_recognizer: can't transition from state: ready"}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        DEBUG   server          stream has ended        {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:04+0000        DEBUG   server          stream mode     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        ERROR   http            write response error    {"start": "2023-12-28T06:51:04+0000", "elapsed": "1.356916787s", "error": "write tcp 172.19.0.2:8080->172.19.0.1:41244: write: broken pipe"}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        INFO    http            http log        {"status": 200, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:41244", "read_bytes": 0, "write_bytes": 172701690, "start": "2023-12-28T06:51:04+0000", "elapsed": "1.357023368s"}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        DEBUG   server          stream was canceled, sending stop bit   {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        DEBUG   server          stop bit was sent, waiting for the response     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        DEBUG   server          fsm transition error    {"error": "fsm_recognizer: can't transition from state: ready"}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        DEBUG   server          stream has ended        {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:06+0000        DEBUG   server          stream mode     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        ERROR   http            write response error    {"start": "2023-12-28T06:51:06+0000", "elapsed": "1.332638731s", "error": "write tcp 172.19.0.2:8080->172.19.0.1:52282: write: broken pipe"}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        INFO    http            http log        {"status": 200, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:52282", "read_bytes": 0, "write_bytes": 171472890, "start": "2023-12-28T06:51:06+0000", "elapsed": "1.332837923s"}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        DEBUG   server          stream was canceled, sending stop bit   {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        DEBUG   server          stop bit was sent, waiting for the response     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        DEBUG   server          fsm transition error    {"error": "fsm_recognizer: can't transition from state: ready"}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        DEBUG   server          stream has ended        {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:08+0000        DEBUG   server          stream mode     {"pid": 20}
rr_test-roadrunner-1  | 2023-12-28T06:51:09+0000        ERROR   http            write response error    {"start": "2023-12-28T06:51:08+0000", "elapsed": "891.974786ms", "error": "write tcp 172.19.0.2:8080->172.19.0.1:52296: write: broken pipe"}
rr_test-roadrunner-1  | 2023-12-28T06:51:09+0000        INFO    http            http log        {"status": 200, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:52296", "read_bytes": 0, "write_bytes": 114982578, "start": "2023-12-28T06:51:08+0000", "elapsed": "892.117596ms"}
rr_test-roadrunner-1  | 2023-12-28T06:52:10+0000        ERROR   server          no free workers in the pool, wait timeout exceed        {"reason": "no free workers", "internal_event_name": "EventNoFreeWorkers", "error": "worker_watcher_get_free_worker: NoFreeWorkers:\n\tcontext deadline exceeded"}
rr_test-roadrunner-1  | 2023-12-28T06:52:10+0000        ERROR   http            execute {"start": "2023-12-28T06:51:10+0000", "elapsed": "1m0.001002103s", "error": "static_pool_exec: NoFreeWorkers:\n\tstatic_pool_exec:\n\tworker_watcher_get_free_worker:\n\tcontext deadline exceeded"}
rr_test-roadrunner-1  | 2023-12-28T06:52:10+0000        INFO    http            http log        {"status": 500, "method": "GET", "URI": "/", "remote_address": "172.19.0.1:52308", "read_bytes": 0, "write_bytes": 0, "start": "2023-12-28T06:51:10+0000", "elapsed": "1m0.001074533s"}
rustatian commented 8 months ago

Hey @segrax 👋 Thanks for the report 👍 I'll investigate the problem.

rustatian commented 8 months ago

Looks like, that worker returned to the pool in the ready state, but actually, by some reason, doesn't really located in the pool... Really interesting case.

rustatian commented 8 months ago

Ok, firstly, I found a problem why exec_ttl had no power in your case (will be fixed, from the RR side). Secondly, I found a problem why worker hangs on fwite (and this needs to be fixed from the PHP side).

rustatian commented 8 months ago

The fix will be released next Thursday.

segrax commented 8 months ago

That's great news! Appreciate the swift response @rustatian