lasp-lang / partisan

High-performance, high-scalability distributed computing for the BEAM.
https://partisan.dev
Apache License 2.0
914 stars 60 forks source link

Multiple interfaces peer problem #250

Closed mcesaro closed 1 year ago

mcesaro commented 1 year ago

Hi Alejandro, I'm trying to use partisan on a development system with several lxd/lxc containers using a bridge to connect with the host and the internet. In my dev scenario, a peer ('clockwork@max-a5') is on the host sysyem with an IP address of 192.168.178.29 and a second peer ('eusebio@lora-dev') is in a container with IP address of 10.131.114.208. Using disterl, the two nodes successfully ping each other. In the application, the two nodes exchange the partisan node_spec using an API call and the server returns something like:

#{name => 'clockwork@max-a5',
  channels =>
      #{undefined =>
            #{monotonic => false,parallelism => 1,compression => false},
        data =>
            #{monotonic => false,parallelism => 1,compression => false},
        partisan_membership =>
            #{monotonic => false,parallelism => 1,compression => true}},
  listen_addrs => [#{port => 10201,ip => {127,0,0,1}}]}

Note that the listen address is 127.0.0.1

When I try to connect them using partisan partisan_pluggable_peer_service_manager, I get an ok as the return of the partisan_peer_service:join/1 call. However a subsequent partisan_peer_service:connections/0 call returns an empty list

(eusebio@lora-dev)9> partisan_peer_service:connections().
{ok,[]}

and the two nodes are not communicating over partisan. As partisan is using gen_tcp to implement that connectivity, I did a test from the shell of the nodes:

(clockwork@max-a5)3> {ok, Ls} = gen_tcp:listen(10205, [inet, binary, {ip, {192,168,178,29}}]).
{ok,#Port<0.64>}
(clockwork@max-a5)4> gen_tcp:accept(Ls).  % <- suspend waiting for a client

(eusebio@lora-dev)16> gen_tcp:connect('max-a5', 10205, [inet], 5000).
{ok,#Port<0.36>} # client connected!

(clockwork@max-a5)3>
{ok,#Port<0.65>} # server connected

Using the IP address for the server, all works as expected. Using the partisan settings, i.e. the host address 127.0.0.1, it fails with a connection refused error:

(clockwork@max-a5)5> {ok, Lsl} = gen_tcp:listen(10206, [inet, binary, {ip, {127,0,0,1}}]).
{ok,#Port<0.66>}
(clockwork@max-a5)6> gen_tcp:accept(Lsl). # suspend

(eusebio@lora-dev)18> gen_tcp:connect('max-a5', 10206, [inet], 5000).
{error,econnrefused}

Note that f the two peers are on the same host, then there is no issue using the 127.0.0.1 address.

To see if changing partisan configuration would fix things, I tried this setting onn the host:

   {partisan, [
        {peer_port, 10201},
        {peer_host, {192,168,178,29}},
        {pid_encoding, false},
        {ref_encoding, false},
        {remote_ref_format, improper_list},
        {channels, [{data, #{parallelism => 1}}]},
        {partisan_peer_service_manager, partisan_pluggable_peer_service_manager}
        ]}

Hoping that the underlying gen_ tcp would work the same. However in this configuration partisan crashes like this:

=CRASH REPORT==== 1-Oct-2023::10:49:30.165230 ===
  crasher:
    initial call: supervisor:partisan_acceptor_socket_pool_sup/1
    pid: <0.888.0>
    registered_name: []
    exception error: no function clause matching 
                     partisan_acceptor_socket_pool_sup:socket(#{port => 10201,
                                                                host =>
                                                                    {192,168,
                                                                     178,29}}) (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 77)
      in function  partisan_acceptor_socket_pool_sup:'-init/1-lc$^0/1-0-'/1 (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 64)
      in call from partisan_acceptor_socket_pool_sup:init/1 (/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl, line 64)
      in call from supervisor:init/1 (supervisor.erl, line 330)
      in call from gen_server:init_it/2 (gen_server.erl, line 962)
      in call from gen_server:init_it/6 (gen_server.erl, line 917)
    ancestors: [partisan_sup,<0.871.0>]
    message_queue_len: 0
    messages: []
    links: [<0.872.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 376
    stack_size: 28
    reductions: 179
  neighbours:

=SUPERVISOR REPORT==== 1-Oct-2023::10:49:30.178979 ===
    supervisor: {local,partisan_sup}
    errorContext: start_error
    reason: {function_clause,
                [{partisan_acceptor_socket_pool_sup,socket,
                     [#{port => 10201,host => {192,168,178,29}}],
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,77}]},
                 {partisan_acceptor_socket_pool_sup,'-init/1-lc$^0/1-0-',1,
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,64}]},
                 {partisan_acceptor_socket_pool_sup,init,1,
                     [{file,
                          "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket_pool_sup.erl"},
                      {line,64}]},
                 {supervisor,init,1,[{file,"supervisor.erl"},{line,330}]},
                 {gen_server,init_it,2,[{file,"gen_server.erl"},{line,962}]},
                 {gen_server,init_it,6,[{file,"gen_server.erl"},{line,917}]},
                 {proc_lib,init_p_do_apply,3,
                     [{file,"proc_lib.erl"},{line,241}]}]}
    offender: [{pid,undefined},
               {id,partisan_acceptor_socket_pool_sup},
               {mfargs,{partisan_acceptor_socket_pool_sup,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,20000},
               {child_type,supervisor}]

and line 77 of partisan_acceptor_socket_pool_sup.erl

socket(#{ip := IP, port := Port}) ->
    #{
        id => {partisan_acceptor_socket, IP, Port},
        start => {partisan_acceptor_socket, start_link, [IP, Port]}
    }.

makes me think the the ip key in the map is replace by the host key , so that it's missing causing the crash. This sounds like a configuration issue in lines 461-467 of partisan_config:

    %% Setup default listen addr.
    %% This will be part of the partisan:node_spec() which is the map
    DefaultAddr0 = #{port => get(peer_port)},

    DefaultAddr =
        case get(peer_host) of
            undefined ->
                DefaultAddr0#{ip => get(peer_ip)};
            Host ->
                DefaultAddr0#{host => Host}
        end,

If you agree, I would like to submit a PR where I replace line 466 with

      DefaultAddr0#{ip => Host, host => Host}

just to try to not break things. Wht I still can' t explain is why the join call returns ok but the connection is not established. This might be a much worse issue.

aramallo commented 1 year ago

Hi Massimo,

First of all thanks for trying out Partisan.

So based on the above I think that peer_host is broken. But you should be able to get it going using peer_ip in the meantime.

   {partisan, [
        {peer_port, 10201},
        {peer_ip, {192,168,178,29}},
        {pid_encoding, false},
        {ref_encoding, false},
        {remote_ref_format, improper_list},
        {channels, [{data, #{parallelism => 1}}]},
        {partisan_peer_service_manager, partisan_pluggable_peer_service_manager}
        ]}

I think the reason why Partisan defaults the loopback interface address is because inet:get_addr("max-a5") cannot resolve to {192,168,178,29}.

Looking into this I also noticed we are not supporting IP resolution for local or IPv6 (just IPv4), so if you don't mind I will tackle that and the peer_host issue in my next commit.

aramallo commented 1 year ago

Hi Massimo,

After thoroughly examining the API and configuration options, it is clear that some improvements are needed.

Solution

The peer_host option will be deprecated, as it has never really worked.

Currently, it is not possible, via configuration, to pass multiple listened_addr() objects. Therefore, the best option would be to introduce a new option called listen_addrs that gives the user full control.

However, if listen_addrs is undefined, we could build it from peer_ip and peer_port. If those are also undefined, we can extract the IP address from the host by examining the host component of the Erlang nodename and generate a random port (which we already do).

Using the new listen_addrs option you could configure Partisan in the following way

{partisan, [
    ...
    {listen_addrs, [
            "127.0.0.1:12345",
            #{ip => {127, 0, 0, 1}, port => 12345},
            #{ip => "127.0.0.1", port => "12345"},
            {{127, 0, 0, 1}, 12345},
            {"127.0.0.1", "12345"}
  ]}
]}.

Notice I added the same address 5 times but partisan_config will dedup them leaving just #{ip => {127, 0, 0, 1}, port => 12345} .

The table below shows the result for different combinations of vm.args-name and partisan name and peer_ip configuration options while running on clockwork@max-a5 and when listen_addr is not defined.

Case vm.args -name name option peer_ip option → partisan:node() → NodeSpec#listen_addrs.ip
1 undefined undefined undefined aaf5b484-6067-11ee-84a9-fc3b1385cd4e@192.168.178.20 {192,168,178,29}
2 undefined undefined {127,0,0,1} aaf5b484-6067-11ee-84a9-fc3b1385cd4e@127.0.0.1 {127,0,0,1}
3 undefined clockwork@max-a5 undefined clockwork@max-a5 {192,168,178,29}
4 clockwork@max-a5 ignored if defined undefined clockwork@max-a5 {192,168,178,29}
5 clockwork@127.0.0.1 ignored if defined undefined clockwork@127.0.0.1 {127,0,0,1}
6 clockwork@max-a5 ignored if defined undefined clockwork@max-a5 {192,168,178,29}
7 clockwork@192.168.178.29 ignored if defined undefined {192,168,178,29}
8 clockwork@max-a5 ignored if defined 127.0.0.1 clockwork@max-a5 {127,0,0,1}
9 clockwork@127.0.0.1 ignored if defined 127.0.0.1 clockwork@127.0.0.1 {127,0,0,1}

Peer Discovery

You can use the partisan_peer_discovery_agent, which currently offers the dns and list strategies, or you can implement your own custom behavior.

Here is an example of the list agent (partisan_peer_discovery_list):

To instruct eusebio@lora-dev to connect with clockwork on startup, add the following to the sys.config file.

{partisan, [
    ...

    {peer_discovery, #{
        enabled => true,
        type => partisan_peer_discovery_list,
        initial_delay => 5000,
        polling_interval => 30000,
        timeout => 5000,
        config => #{
            addresses => [
                {'clockwork@max-a5', 10201}
            ]
        }
    }}
]}.

I have already implemented the changes mentioned above and am conducting further tests before committing. Please let me know if everything makes sense and if it provides a better experience.

mcesaro commented 1 year ago

Hi Alejandro, the new approach works pretty well! The only problem I had was when specifying a single address in the form:

        {listen_addrs, [
                #{ip => "192.168.178.29", port => "10200"},
            ]},

which still crashed partisan:

                  {partisan,
                   {{error,
                     {shutdown,
                      {failed_to_start_child,
                       partisan_acceptor_socket_pool_sup,
                       {shutdown,
                        {failed_to_start_child,
                         {partisan_acceptor_socket,"192.168.178.29","10200"},
                         {function_clause,
                          [{inet_tcp,getserv,
                            ["10200"],
                            [{file,"inet_tcp.erl"},{line,58}]},
                           {gen_tcp,listen,2,
                            [{file,"gen_tcp.erl"},{line,279}]},
                           {partisan_acceptor_socket,init,1,
                            [{file,
                              "/home/max/work/clockwork/_build/default/lib/partisan/src/partisan_acceptor_socket.erl"},
                             {line,54}]},
                           {gen_server,init_it,2,
                            [{file,"gen_server.erl"},{line,962}]},
                           {gen_server,init_it,6,
                            [{file,"gen_server.erl"},{line,917}]},
                           {proc_lib,init_p_do_apply,3,
                            [{file,"proc_lib.erl"},{line,241}]}]}}}}}},

However, this works fine:

        {listen_addrs, [
                #{ip => {192, 168, 178, 29}, port => 10200}
            ]},

Re peer discovery, I'm going to test the partisan_peer_discovery_dns method as it seems more flexible than the static list used by partisan_peer_discovery_list as the port(s) used with partisan might change depending on the security (firewall, reverse proxy and ha) setup of a multi data center network.

Thank you very much for your support!

aramallo commented 1 year ago

Notice I haven't committed those changes yet 😊. Doing some extra checks but I should be able commit in the next couple of hours. I'll let you know.

With this changes Partisan will allow the diff formats for listen address mentioned above.

aramallo commented 1 year ago

BTW, not sure you've noticed but v5 offers OTP compatibility at the cost of using the partisan_ prefix for the typical module names e.g. partisan_gen_server and the exception partisan_gen_supervisor.

They all use partisan_monitor (API available via partisan module that mimics the erlang module).

aramallo commented 1 year ago

@mcesaro I have just published v5.0.0-rc.8 with the changes above.

mcesaro commented 1 year ago

Hi Alejandro, all my tests passed! Thanks.