fix: coro-channel attempts to resume running thread

Bilal2453 commented 2 months ago

Consider the following minimal-reproduction code:

local ssocket = {
  -- will finish the write instantly
  write = function(self, chunk, callback)
    callback()
    return true
  end,

  -- irrelevant, just to make the example work
  is_closing = function()
    return true
  end 
}

local write = require("coro-channel").wrapWrite(ssocket)
write("") -- it is important to provide any value here, because a nil value will close the socket instead of attempting a write

This will produce the following error

ncaught exception:
cannot resume running coroutine
stack traceback:
    /home/bilal/deps/coro-channel.lua:16: in function 'assertResume'
    /home/bilal/deps/coro-channel.lua:126: in function 'callback'
    /home/bilal/test.lua:5: in function 'write'
    /home/bilal/deps/coro-channel.lua:148: in function 'write'
    /home/bilal/test.lua:17: in function 'fn'
    [string "bundle:deps/require.lua"]:309: in function 'require'
    [string "bundle:/main.lua"]:128: in function <[string "bundle:/main.lua"]:20>
stack traceback:
    [C]: in function 'error'
    /home/bilal/deps/coro-channel.lua:16: in function 'assertResume'
    /home/bilal/deps/coro-channel.lua:126: in function 'callback'
    /home/bilal/test.lua:5: in function 'write'
    /home/bilal/deps/coro-channel.lua:148: in function 'write'
    /home/bilal/test.lua:17: in function 'fn'
    [string "bundle:deps/require.lua"]:309: in function 'require'
    [string "bundle:/main.lua"]:128: in function <[string "bundle:/main.lua"]:20>
stack traceback:
    [C]: in function 'error'
    [string "bundle:/deps/utils.lua"]:42: in function 'assertResume'
    [string "bundle:/init.lua"]:53: in function <[string "bundle:/init.lua"]:48>
    [C]: in function 'xpcall'
    [string "bundle:/init.lua"]:48: in function 'fn'
    [string "bundle:deps/require.lua"]:309: in function <[string "bundle:deps/require.lua"]:266>

This happens because the write call at coro-channel.lua#L148 returns immediately, the callback produced by wait() is also called immediately, this callback will attempt to resume the current thread, but the coroutine.yield line has not yet been executed (since write finished instantly).

This minimal example is basically what happens when a socket is wrapped with secure-socket , which will attempt to optimize empty writes by returning immediately (in the flush function over here it will execute the callback and return early).

Which leads to https/wss requests made by coro-net crash when code such as require'coro-http'.request("POST", "https://google.com", {}, "") is ran (secure-socket sees that we are trying to write an empty chunk for the payload, so it decides to just return early and immediately since such a write will be ignored by libuv anyways).

The solution suggested in this PR is to basically keep track of when the thread has yielded, and if the wait callback is called before we reach yield, then we don't bother with resuming current thread and just return, similar to what we did with coro-split.

fixes #303.

Bilal2453 commented 2 months ago

Also added a test to catch those cases.

Worth noting some... annoyances, while writing that test I noticed if the writer returns with return true, "this value should be ignored" coro-channel will return that the write has failed (the return of write() will be false indicating a failure) but at the same time it won't attempt to close the handle (since success is true). One part of the code thinks that it succeeded and doesn't attempt to close the socket, and the other part of the code simply checks if the second return is presented to treat it as an error.

Also, a write() could return failure indicated by either nil or false, it may returns either of the values depending on if the failure was picked up with the callback callback("this is an error") or with the write return return false, "this is an error".

Bilal2453 commented 2 months ago

As I expanded the tests I noticed another state-related bug, consider the following code:

local ssocket = {
  write = function(_, _, callback)
    callback('callback failed')
    return true
  end,

  is_closing = function()
    return false
  end 
}

local write = require("coro-channel").wrapWrite(ssocket)
assert(write(""))

The control flow of this goes something like this:

wait is called which resets and prepares the states and upvalues.
The wait() callback is instantly called.
err is set to "callback failed" since the callback failed.
write call now finishes, success is set to true (since the write call succeeded), err is set to nil.
the writer returns success, even though this actually failed.

Usually, the err value will be set by the write call itself (over in success, err = write(...), then it is reused for the callback returned error, but since in this case the callback is executed first this is no more the case.

This is now fixed by separating the call error from the callback error and tracked by two separate states.

Bilal2453 commented 2 weeks ago

This likely won't be the last fix we introduce for a problem like this. I also hate the complexity of this flow and it proved annoying as I got it wrong the first 3 attempts. What should the flow look like ideally if you happen to have an idea?

On Mon, 30 Sept 2024, 16:39 truemedian, @.***> wrote:

@.**** approved this pull request.

In general I dislike the state that this change introduces, which makes it significantly harder to reason about what is happening in the control flow through these functions.

However if this fixes the bug I'm fine with introducing it and writing it off for "clean up later".

— Reply to this email directly, view it on GitHub https://github.com/luvit/lit/pull/329#pullrequestreview-2337592261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJDIIYGIZBFGFYZ3SKW2ETDZZFIARAVCNFSM6AAAAABL6MZ5LOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGMZXGU4TEMRWGE . You are receiving this because you authored the thread.Message ID: @.***>

Bilal2453 commented 1 week ago

@squeek502 should be ready for merge.

luvit / lit

fix: coro-channel attempts to resume running thread #329