tuncer:send timeout - Githubissues

meox commented 1 year ago

I don't understand why I got sporadic {error, timeout}, is this a known issue?

msantos commented 1 year ago

No, I haven't seen any reports of timeouts. It sounds as if a gen_server:call may be timing out.

Checking the code, sends should not timeout. For the send/2 code path:

send/2 has a timeout set to infinity
depending on the tuntap device active mode, either:
- {active, false}: the data is written directly to the non-blocking file descriptor using write(2)
- {active, true | integer()}: the data is written to a port holding the fd using erlang:port_command/2
according to the docs, the only error returned by port_command/2 is badarg

Is it possible another gen_server:call is timing out?

msantos commented 1 year ago

The scheduler can suspend the caller to apply backpressure if using active mode and a lot of data is being written to the port. This could cause a call upstream to timeout.

The busy check can be bypassed using the force option to port_command/3. Apparently erlang distribution drivers set this flag so it makes sense for tunctl to do so too.

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..7dbe7b0 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -744,7 +744,7 @@ handle_call({send, Data}, _From, #state{port = false, fd = FD} = State) ->
     {reply, Reply, State};
 handle_call({send, Data}, _From, #state{port = Port} = State) ->
     Reply =
-        try erlang:port_command(Port, Data) of
+        try erlang:port_command(Port, Data, [force]) of
             true ->
                 ok
         catch

meox commented 1 year ago

yes I'm using active mode!

I tried but I'm receiving not_sup:

Screenshot 2023-08-31 at 23 14 31

msantos commented 1 year ago

yes I'm using active mode!

I tried but I'm receiving not_sup:

Thanks for testing. Sorry, there must be a problem with the tuncer tests: the tests passed for me with the change, I'll look into it.

Another option is bypassing both the gen_server and the port for writes: continue using active mode for reads but switch to writing directly to the device fd:

FD = tuncer:getfd(Ref), % calls into the gen_server
tuncer:write(FD, Data).

meox commented 1 year ago

just to be more precise: I'm using it from a common test

msantos commented 1 year ago

Busy Port Limits

open_port/2 has an option to disable the busy limits. The docs lists the fd driver as supported.

(Diff updated to include busy_limits_msgq)

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..b44e13a 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -845,7 +845,7 @@ code_change(_OldVsn, State, _Extra) ->
 %%% Internal functions
 %%--------------------------------------------------------------------
 set_mode(active, FD) ->
-    open_port({fd, FD, FD}, [stream, binary]);
+    open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]);
 set_mode(passive, Port) ->
     try erlang:port_close(Port) of
         true ->

Tests and `port_command([force])`

About the tuncer tests and notsup, the tests are using {active,false} only. I'll add some tests covering active mode. Note: with active mode enabled, I still wasn't able to reproduce the notsup but using {busy_limits_port, disabled} looks like a better fix.

meox commented 1 year ago

it's no clear for me if there is some case where it's not safe to disable those limits

msantos commented 1 year ago

it's no clear for me if there is some case where it's not safe to disable those limits

Good point. My theory is tuntap devices have the same issues as writing a distribution driver. There can be a sort of priority inversion where the application is waiting because the scheduler is applying backpressure to the process performing the I/O. Similar to distribution drivers, the busy limits should be disabled either using port_command([force]) or open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]).

My guess is that the {error, timeout} you see is caused by a, e.g., gen_server:call/2 somewhere before the call to send. The default timeout is 5 seconds which is signifcant latency for a network. Disabling the busy limits might reduce the latency. But note other processes in the chain, like the tuncer gen_server, may also be suspended.

Without code to reproduce or a stack trace, that's just speculation. So I'll plan to do the following:

add tests covering active mode
write a stress test to see if
- the {error, timeout} is reproducible
- disabling the busy limits has any effect
if disabling busy limits is useful, decide whether to enable or disable by default either by a new option or allowing open_port options to be passed in directly to create/2

meox commented 1 year ago

good! thanks a lot for your support

msantos / tunctl

tuncer:send timeout #25

Busy Port Limits

Tests and `port_command([force])`

msantos / tunctl

tuncer:send timeout #25

Busy Port Limits

Tests and port_command([force])

Tests and `port_command([force])`