Multiple interfaces peer problem

Hi Alejandro, I'm trying to use partisan on a development system with several lxd/lxc containers using a bridge to connect with the host and the internet. In my dev scenario, a peer ('clockwork@max-a5') is on the host sysyem with an IP address of 192.168.178.29 and a second peer ('eusebio@lora-dev') is in a container with IP address of 10.131.114.208. Using disterl, the two nodes successfully ping each other. In the application, the two nodes exchange the partisan node_spec using an API call and the server returns something like:

#{name => 'clockwork@max-a5',
  channels =>
      #{undefined =>
            #{monotonic => false,parallelism => 1,compression => false},
        data =>
            #{monotonic => false,parallelism => 1,compression => false},
        partisan_membership =>
            #{monotonic => false,parallelism => 1,compression => true}},
  listen_addrs => [#{port => 10201,ip => {127,0,0,1}}]}

Note that the listen address is 127.0.0.1

When I try to connect them using partisan partisan_pluggable_peer_service_manager, I get an ok as the return of the partisan_peer_service:join/1 call. However a subsequent partisan_peer_service:connections/0 call returns an empty list

(eusebio@lora-dev)9> partisan_peer_service:connections().
{ok,[]}

and the two nodes are not communicating over partisan. As partisan is using gen_tcp to implement that connectivity, I did a test from the shell of the nodes:

(clockwork@max-a5)3> {ok, Ls} = gen_tcp:listen(10205, [inet, binary, {ip, {192,168,178,29}}]).
{ok,#Port<0.64>}
(clockwork@max-a5)4> gen_tcp:accept(Ls).  % <- suspend waiting for a client

(eusebio@lora-dev)16> gen_tcp:connect('max-a5', 10205, [inet], 5000).
{ok,#Port<0.36>} # client connected!

(clockwork@max-a5)3>
{ok,#Port<0.65>} # server connected

Using the IP address for the server, all works as expected. Using the partisan settings, i.e. the host address 127.0.0.1, it fails with a connection refused error:

(clockwork@max-a5)5> {ok, Lsl} = gen_tcp:listen(10206, [inet, binary, {ip, {127,0,0,1}}]).
{ok,#Port<0.66>}
(clockwork@max-a5)6> gen_tcp:accept(Lsl). # suspend

(eusebio@lora-dev)18> gen_tcp:connect('max-a5', 10206, [inet], 5000).
{error,econnrefused}

Note that f the two peers are on the same host, then there is no issue using the 127.0.0.1 address.

To see if changing partisan configuration would fix things, I tried this setting onn the host:

   {partisan, [
        {peer_port, 10201},
        {peer_host, {192,168,178,29}},
        {pid_encoding, false},
        {ref_encoding, false},
        {remote_ref_format, improper_list},
        {channels, [{data, #{parallelism => 1}}]},
        {partisan_peer_service_manager, partisan_pluggable_peer_service_manager}
        ]}

Hoping that the underlying gen_ tcp would work the same. However in this configuration partisan crashes like this:

=CRASH REPORT==== 1-Oct-2023::10:49:30.165230 ===
  crasher:
    initial call: supervisor:partisan_acceptor_socket_pool_sup/1
    pid: <0.888.0>
    registered_name: []
    exception error: no function clause matching 
                     partisan_acceptor_socket_pool_sup:socket(#{port => 10201,
                                                                host =>
                                                                    {192,168,
                                                                     178,29}}) (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 77)
      in function  partisan_acceptor_socket_pool_sup:'-init/1-lc$^0/1-0-'/1 (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 64)
      in call from partisan_acceptor_socket_pool_sup:init/1 (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 64)
      in call from supervisor:init/1 (supervisor.erl, line 330)
      in call from gen_server:init_it/2 (gen_server.erl, line 962)
      in call from gen_server:init_it/6 (gen_server.erl, line 917)
    ancestors: [partisan_sup,<0.871.0>]
    message_queue_len: 0
    messages: []
    links: [<0.872.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 28
    reductions: 179
  neighbours:

=SUPERVISOR REPORT==== 1-Oct-2023::10:49:30.178979 ===
    supervisor: {local,partisan_sup}
    errorContext: start_error
    reason: {function_clause,
                [{partisan_acceptor_socket_pool_sup,socket,
                     [#{port => 10201,host => {192,168,178,29}}],
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,77}]},
                 {partisan_acceptor_socket_pool_sup,'-init/1-lc$^0/1-0-',1,
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,64}]},
                 {partisan_acceptor_socket_pool_sup,init,1,
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,64}]},
                 {supervisor,init,1,[{file,"supervisor.erl"},{line,330}]},
                 {gen_server,init_it,2,[{file,"gen_server.erl"},{line,962}]},
                 {gen_server,init_it,6,[{file,"gen_server.erl"},{line,917}]},
                 {proc_lib,init_p_do_apply,3,
                     [{file,"proc_lib.erl"},{line,241}]}]}
    offender: [{pid,undefined},
               {id,partisan_acceptor_socket_pool_sup},
               {mfargs,{partisan_acceptor_socket_pool_sup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,20000},
               {child_type,supervisor}]

and line 77 of partisan_acceptor_socket_pool_sup.erl

socket(#{ip := IP, port := Port}) ->
    #{
        id => {partisan_acceptor_socket, IP, Port},
        start => {partisan_acceptor_socket, start_link, [IP, Port]}
    }.

makes me think the the ip key in the map is replace by the host key , so that it's missing causing the crash. This sounds like a configuration issue in lines 461-467 of partisan_config:

    %% Setup default listen addr.
    %% This will be part of the partisan:node_spec() which is the map
    DefaultAddr0 = #{port => get(peer_port)},

    DefaultAddr =
        case get(peer_host) of
            undefined ->
                DefaultAddr0#{ip => get(peer_ip)};
            Host ->
                DefaultAddr0#{host => Host}
        end,

If you agree, I would like to submit a PR where I replace line 466 with

      DefaultAddr0#{ip => Host, host => Host}

just to try to not break things. Wht I still can' t explain is why the join call returns ok but the connection is not established. This might be a much worse issue.

Hi Massimo,

First of all thanks for trying out Partisan.

So based on the above I think that peer_host is broken. But you should be able to get it going using peer_ip in the meantime.

   {partisan, [
        {peer_port, 10201},
        {peer_ip, {192,168,178,29}},
        {pid_encoding, false},
        {ref_encoding, false},
        {remote_ref_format, improper_list},
        {channels, [{data, #{parallelism => 1}}]},
        {partisan_peer_service_manager, partisan_pluggable_peer_service_manager}
        ]}

I think the reason why Partisan defaults the loopback interface address is because inet:get_addr("max-a5") cannot resolve to {192,168,178,29}.

Looking into this I also noticed we are not supporting IP resolution for local or IPv6 (just IPv4), so if you don't mind I will tackle that and the peer_host issue in my next commit.

Hi Massimo,

After thoroughly examining the API and configuration options, it is clear that some improvements are needed.

Solution

The peer_host option will be deprecated, as it has never really worked.

Currently, it is not possible, via configuration, to pass multiple listened_addr() objects. Therefore, the best option would be to introduce a new option called listen_addrs that gives the user full control.

However, if listen_addrs is undefined, we could build it from peer_ip and peer_port. If those are also undefined, we can extract the IP address from the host by examining the host component of the Erlang nodename and generate a random port (which we already do).

Using the new listen_addrs option you could configure Partisan in the following way

{partisan, [
    ...
    {listen_addrs, [
            "127.0.0.1:12345",
            #{ip => {127, 0, 0, 1}, port => 12345},
            #{ip => "127.0.0.1", port => "12345"},
            {{127, 0, 0, 1}, 12345},
            {"127.0.0.1", "12345"}
  ]}
]}.

Notice I added the same address 5 times but partisan_config will dedup them leaving just #{ip => {127, 0, 0, 1}, port => 12345} .

The table below shows the result for different combinations of vm.args-name and partisan name and peer_ip configuration options while running on clockwork@max-a5 and when listen_addr is not defined.

Case	vm.args -name	name option	peer_ip option	→ partisan:node()	→ NodeSpec#listen_addrs.ip
1	undefined	undefined	undefined	aaf5b484-6067-11ee-84a9-fc3b1385cd4e@192.168.178.20	{192,168,178,29}
2	undefined	undefined	{127,0,0,1}	aaf5b484-6067-11ee-84a9-fc3b1385cd4e@127.0.0.1	{127,0,0,1}
3	undefined	clockwork@max-a5	undefined	clockwork@max-a5	{192,168,178,29}
4	clockwork@max-a5	ignored if defined	undefined	clockwork@max-a5	{192,168,178,29}
5	clockwork@127.0.0.1	ignored if defined	undefined	clockwork@127.0.0.1	{127,0,0,1}
6	clockwork@max-a5	ignored if defined	undefined	clockwork@max-a5	{192,168,178,29}
7	clockwork@192.168.178.29	ignored if defined	undefined		{192,168,178,29}
8	clockwork@max-a5	ignored if defined	127.0.0.1	clockwork@max-a5	{127,0,0,1}
9	clockwork@127.0.0.1	ignored if defined	127.0.0.1	clockwork@127.0.0.1	{127,0,0,1}

Peer Discovery

You can use the partisan_peer_discovery_agent, which currently offers the dns and list strategies, or you can implement your own custom behavior.

Here is an example of the list agent (partisan_peer_discovery_list):

To instruct eusebio@lora-dev to connect with clockwork on startup, add the following to the sys.config file.

{partisan, [
    ...

    {peer_discovery, #{
        enabled => true,
        type => partisan_peer_discovery_list,
        initial_delay => 5000,
        polling_interval => 30000,
        timeout => 5000,
        config => #{
            addresses => [
                {'clockwork@max-a5', 10201}
            ]
        }
    }}
]}.

I have already implemented the changes mentioned above and am conducting further tests before committing. Please let me know if everything makes sense and if it provides a better experience.

Hi Alejandro, the new approach works pretty well! The only problem I had was when specifying a single address in the form:

        {listen_addrs, [
                #{ip => "192.168.178.29", port => "10200"},
            ]},

which still crashed partisan:

                  {partisan,
                   {{error,
                     {shutdown,
                      {failed_to_start_child,
                       partisan_acceptor_socket_pool_sup,
                       {shutdown,
                        {failed_to_start_child,
                         {partisan_acceptor_socket,"192.168.178.29","10200"},
                         {function_clause,
                          [{inet_tcp,getserv,
                            ["10200"],
                            [{file,"inet_tcp.erl"},{line,58}]},
                           {gen_tcp,listen,2,
                            [{file,"gen_tcp.erl"},{line,279}]},
                           {partisan_acceptor_socket,init,1,
                            [{file,
                              "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket.erl"},
                             {line,54}]},
                           {gen_server,init_it,2,
                            [{file,"gen_server.erl"},{line,962}]},
                           {gen_server,init_it,6,
                            [{file,"gen_server.erl"},{line,917}]},
                           {proc_lib,init_p_do_apply,3,
                            [{file,"proc_lib.erl"},{line,241}]}]}}}}}},

However, this works fine:

        {listen_addrs, [
                #{ip => {192, 168, 178, 29}, port => 10200}
            ]},

Re peer discovery, I'm going to test the partisan_peer_discovery_dns method as it seems more flexible than the static list used by partisan_peer_discovery_list as the port(s) used with partisan might change depending on the security (firewall, reverse proxy and ha) setup of a multi data center network.

Thank you very much for your support!

Notice I haven't committed those changes yet 😊. Doing some extra checks but I should be able commit in the next couple of hours. I'll let you know.

With this changes Partisan will allow the diff formats for listen address mentioned above.

BTW, not sure you've noticed but v5 offers OTP compatibility at the cost of using the partisan_ prefix for the typical module names e.g. partisan_gen_server and the exception partisan_gen_supervisor.

They all use partisan_monitor (API available via partisan module that mimics the erlang module).

@mcesaro I have just published v5.0.0-rc.8 with the changes above.

Hi Alejandro, all my tests passed! Thanks.

lasp-lang / partisan

Multiple interfaces peer problem #250

Solution

Peer Discovery