Memory leak(?) - Githubissues

negativ commented 9 years ago

This erlang module accepts connections on unix-domain socket and runs every new client in infinite loop (handle/1) in new process. Incoming message is an erlang term {Function, Args} in ETF with 4 bytes at beginning that indicates payload size. Client generates ~50rps (each 50-100 bytes long).

When i run server with 4-5 clients its memory constantly grows up (~2.5-3MBph) and never freed up. When i replace all procket code with gen_tcp all become ok and there are no memory leaks.

Is there some logic error in my code? Or memory leak in procket?

-module(hawk_uds).

%% API
-export([start_link/0,
         init/0]).
-export([accept/1, handle/1]).
-define(UDS_PATH, <<"/var/hawk/hawk.uds">>).

start_link() ->
  {ok, spawn_link(?MODULE, init, [])}.

init() ->
  file:delete(?UDS_PATH),
  {ok, Fd} = procket:socket(unix, stream, 0),
  Len = byte_size(?UDS_PATH),
  Sun = <<(procket:sockaddr_common(1, Len))/binary, ?UDS_PATH/binary, 0:((procket:unix_path_max() - Len) * 8)>>,
  ok = procket:bind(Fd, Sun),
  ok = procket:listen(Fd),
  accept(Fd).

accept(Fd) ->
  case procket:accept(Fd) of
    {error, eagain} ->
      timer:sleep(50); %% is there some scheduler-friendly accept/1 ?
    {ok, ClientFd} ->
      spawn_opt(?MODULE, handle, [ClientFd], [{fullsweep_after, 0}])
  end,
  accept(Fd).

handle(Sock) ->
  {ok, PackLen} = read_packet_len(Sock),
  {ok, Binary} = read_packet(Sock, PackLen),

  {Func, Args} = _Req = binary_to_term(Binary),

  Ret = erlang:apply(hawk, Func, Args),
  BinRet = term_to_binary(Ret),
  BinSize = byte_size(BinRet),

  case procket:write(Sock, <<BinSize:32/big-unsigned-integer, BinRet/binary>>) of
    {error, _} ->
      procket:close(Sock),
      erlang:exit({reason, connection_refused});
    _ ->
      ok
  end,

  handle(Sock).

read_packet_len(Fd) ->
  {ok, <<Len:32/big-unsigned-integer>>} = read_packet(Fd, 4, <<>>), {ok, Len}.

read_packet(Fd, Len) ->
  read_packet(Fd, Len, <<>>).

read_packet(_Fd, 0, Bin) ->
  {ok, Bin};
read_packet(Fd, Len, Bin) ->
  case procket:read(Fd, Len) of
    {error, eagain} ->
      {ok, read} = inert:poll(Fd),
      read_packet(Fd, Len, Bin);
    {ok, <<>>} ->
      procket:close(Fd),
      erlang:exit({reason, connection_refused});
    {ok, Data} when Len == erlang:byte_size(Data) ->
      {ok, Data};
    {ok, Data} ->
      read_packet(Fd, Len - byte_size(Data), <<Bin/binary, Data/binary>>)
  end.

msantos commented 9 years ago

@negativ thanks for letting me know about this and for the working code!

I suspect if there's a leak, it will be in inert. I'll run some stress tests. Is your client doing many connect/send/disconnects or staying connected?

With procket, it's possible to leak fd's. You might want to check:

lsof -p <pidofbeam>

In your example, inert needs to be started before being used. Otherwise, inert will return {error,closed}, the process will crash and the fd will leak.

inert can also be used to block on the listening socket before the accept.

Here are some quick changes, I'm echoing back the data for testing:

--- hawk_uds.erl.orig   2015-07-22 15:53:29.216640062 -0400
+++ hawk_uds.erl    2015-07-22 15:53:59.206640055 -0400
@@ -12,4 +12,5 @@

 init() ->
+  inert:start(),
   file:delete(?UDS_PATH),
   {ok, Fd} = procket:socket(unix, stream, 0),
@@ -22,10 +23,17 @@

 accept(Fd) ->
+  case inert:poll(Fd) of
+    {ok,read} -> ok;
+    Error ->
+      procket:close(Fd),
+      erlang:exit({accept, Error})
+  end,
   case procket:accept(Fd) of
     {error, eagain} ->
-      timer:sleep(50); %% is there some scheduler-friendly accept/1 ?
+      ok;
     {ok, ClientFd} ->
       spawn_opt(?MODULE, handle, [ClientFd], [{fullsweep_after, 0}])
   end,
+    error_logger:info_report({accept, 3}),
   accept(Fd).

@@ -35,8 +43,9 @@
   {ok, Binary} = read_packet(Sock, PackLen),

-  {Func, Args} = _Req = binary_to_term(Binary),
+%  {Func, Args} = _Req = binary_to_term(Binary),

-  Ret = erlang:apply(hawk, Func, Args),
-  BinRet = term_to_binary(Ret),
+%  Ret = erlang:apply(hawk, Func, Args),
+%  BinRet = term_to_binary(Ret),
+  BinRet = Binary,
   BinSize = byte_size(BinRet),

@@ -52,5 +61,10 @@

 read_packet_len(Fd) ->
-  {ok, <<Len:32/big-unsigned-integer>>} = read_packet(Fd, 4, <<>>), {ok, Len}.
+  case read_packet(Fd, 4, <<>>) of
+    {ok, <<Len:32/big-unsigned-integer>>} -> {ok, Len};
+    _Error ->
+      procket:close(Fd),
+      erlang:exit({reason, read_packet_len})
+  end.

@@ -64,6 +78,11 @@
   case procket:read(Fd, Len) of
     {error, eagain} ->
-      {ok, read} = inert:poll(Fd),
-      read_packet(Fd, Len, Bin);
+      case inert:poll(Fd) of
+        {ok, read} ->
+          read_packet(Fd, Len, Bin);
+        _ ->
+          procket:close(Fd),
+          ok
+      end;
     {ok, <<>>} ->
       procket:close(Fd),
@@ -72,4 +91,7 @@
       {ok, Data};
     {ok, Data} ->
-      read_packet(Fd, Len - byte_size(Data), <<Bin/binary, Data/binary>>)
+      read_packet(Fd, Len - byte_size(Data), <<Bin/binary, Data/binary>>);
+    Error ->
+        error_logger:info_report({read_packet, Error}),
+        procket:close(Fd)
   end.

negativ commented 9 years ago

inert started at top-level supervisor - this code is a part of huge project, so i just post here example of code that cause memory leak. Client connects to server an stay connected without reconnects ~12 hours. There is no fd-leaks - code was well-tested and its ok.

By a first time i thought that my code leaks refcbins, so i did some crazy work for rewriting logic of server. =) Finally, i decide to use gen_tcp in same way as inert + procket combination and all problems just gone away =)

msantos commented 9 years ago

So far I haven't been able to reproduce this. I'm testing using 2 echo servers:

xecho: gen_tcp echo server

https://gist.github.com/msantos/ba8c9e443da4058ac830

iecho: procket/inert echo server

https://gist.github.com/msantos/fb0accb7ce0d2e657f86

xt: echo client

https://gist.github.com/msantos/c9b6c459e10f44ab90c2

I started the echo servers and 2 clients:

% Erlang VM 1: gen_tcp
1> xecho:listen(9999).

% Erlang VM 2: procket/inert
1> iecho:listen(8888).

% Erlang VM 3: port 9999, 10 clients, 1 ms between requests, run forever
1> xt:start(9999, 10, 1, -1).

% Erlang VM 4: port 8888, 10 clients, 1 ms delay, run forever
1> xt:start(8888, 10, 1, -1).

After connecting, the client will send 100 bytes of data, wait for the response and sleep for 1 ms in a loop.

I used pidstat to get a general idea of the size of the VMs. The result after running for a few hours:

# 23927 = xecho: gen_tcp
# 23983 = iecho: procket/inert
$ pidstat -I 60 -ru -p 23927 -p 23983

Average:      UID       PID    %usr %system  %guest    %CPU   CPU  Command
Average:     1000     23927   17.40   12.39    0.00    7.97     -  beam.smp
Average:     1000     23983   13.74   13.31    0.00    7.24     -  beam.smp

Average:      UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
Average:     1000     23927      0.05      0.00  778960  35336   0.22  beam.smp
Average:     1000     23983      0.00      0.00  784128  30032   0.18  beam.smp

I am going to run a few more tests but let me know if you can think of anything I should try.

Have you tried using recon to see where the memory is going?

https://github.com/ferd/recon

msantos commented 9 years ago

Same test using with hawk_uds.erl modified to echo back the packets over a unix socket:

hawk_uds.erl https://gist.github.com/msantos/b69b4534515f645580b9
xu: unix socket echo client https://gist.github.com/msantos/8cb3693cc65592098ee5

# Erlang VM 1
1> hawk_uds:start_link("/tmp/t.s").

# Erlang VM 2: 50 clients, 1 ms delay between sends, run forever
1> xu:start("/tmp/t.s", 50, 1, -1).

msantos commented 9 years ago

Memory usage is stable with both the TCP and unix socket servers.

About your code, my guess is a process mailbox is blowing up, probably from an unhandled error message. Use sys:get_status/1 or recon to see what is going on.

negativ commented 9 years ago

Problem not in growing mailbox because i tested version which spawns new process in every new iteration of handle/1. So, i try to reproduce the problem with yours test modules today.

negativ commented 9 years ago

Ok, i found that problem was in leaking procket fd's. At some moment supervisor stop's inert and all clients that uses procket + inert go to hell. =)

Sorry for incorrect report.

msantos / procket

Memory leak(?) #23