msantos / tunctl

Erlang TUN/TAP interface
BSD 3-Clause "New" or "Revised" License
79 stars 23 forks source link

tuncer:send timeout #25

Open meox opened 1 year ago

meox commented 1 year ago

I don't understand why I got sporadic {error, timeout}, is this a known issue?

msantos commented 1 year ago

No, I haven't seen any reports of timeouts. It sounds as if a gen_server:call may be timing out.

Checking the code, sends should not timeout. For the send/2 code path:

Is it possible another gen_server:call is timing out?

msantos commented 1 year ago

The scheduler can suspend the caller to apply backpressure if using active mode and a lot of data is being written to the port. This could cause a call upstream to timeout.

The busy check can be bypassed using the force option to port_command/3. Apparently erlang distribution drivers set this flag so it makes sense for tunctl to do so too.

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..7dbe7b0 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -744,7 +744,7 @@ handle_call({send, Data}, _From, #state{port = false, fd = FD} = State) ->
     {reply, Reply, State};
 handle_call({send, Data}, _From, #state{port = Port} = State) ->
     Reply =
-        try erlang:port_command(Port, Data) of
+        try erlang:port_command(Port, Data, [force]) of
             true ->
                 ok
         catch
meox commented 1 year ago

yes I'm using active mode!

I tried but I'm receiving not_sup:

Screenshot 2023-08-31 at 23 14 31

msantos commented 1 year ago

yes I'm using active mode!

I tried but I'm receiving not_sup:

Screenshot 2023-08-31 at 23 14 31

Thanks for testing. Sorry, there must be a problem with the tuncer tests: the tests passed for me with the change, I'll look into it.

Another option is bypassing both the gen_server and the port for writes: continue using active mode for reads but switch to writing directly to the device fd:

FD = tuncer:getfd(Ref), % calls into the gen_server
tuncer:write(FD, Data).
meox commented 1 year ago

just to be more precise: I'm using it from a common test

msantos commented 1 year ago

Busy Port Limits

open_port/2 has an option to disable the busy limits. The docs lists the fd driver as supported.

(Diff updated to include busy_limits_msgq)

diff --git a/src/tuncer.erl b/src/tuncer.erl
index 6bda60e..b44e13a 100644
--- a/src/tuncer.erl
+++ b/src/tuncer.erl
@@ -845,7 +845,7 @@ code_change(_OldVsn, State, _Extra) ->
 %%% Internal functions
 %%--------------------------------------------------------------------
 set_mode(active, FD) ->
-    open_port({fd, FD, FD}, [stream, binary]);
+    open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]);
 set_mode(passive, Port) ->
     try erlang:port_close(Port) of
         true ->

Tests and port_command([force])

About the tuncer tests and notsup, the tests are using {active,false} only. I'll add some tests covering active mode. Note: with active mode enabled, I still wasn't able to reproduce the notsup but using {busy_limits_port, disabled} looks like a better fix.

meox commented 1 year ago

it's no clear for me if there is some case where it's not safe to disable those limits

msantos commented 1 year ago

it's no clear for me if there is some case where it's not safe to disable those limits

Good point. My theory is tuntap devices have the same issues as writing a distribution driver. There can be a sort of priority inversion where the application is waiting because the scheduler is applying backpressure to the process performing the I/O. Similar to distribution drivers, the busy limits should be disabled either using port_command([force]) or open_port({fd, FD, FD}, [stream, binary, {busy_limits_port, disabled}, {busy_limits_msgq, disabled}]).

My guess is that the {error, timeout} you see is caused by a, e.g., gen_server:call/2 somewhere before the call to send. The default timeout is 5 seconds which is signifcant latency for a network. Disabling the busy limits might reduce the latency. But note other processes in the chain, like the tuncer gen_server, may also be suspended.

Without code to reproduce or a stack trace, that's just speculation. So I'll plan to do the following:

meox commented 1 year ago

good! thanks a lot for your support