travelping / hello

Erlang RPC server framework
MIT License
31 stars 20 forks source link

improve subscription reliability #90

Open rixmann opened 8 years ago

rixmann commented 8 years ago

When replying asynchronously from a hello server, the server context disappears (due to network failure). For subscriptions to work reliable this mechanism has to become more robust.

currently on the server side this error message is observed when trying to push a message to a client where the server context was removed by hello:

Mar 09 17:53:02 vlx129-tpb pcs[186]: ** (MatchError) no match of right hand side value: :invalid_identity
Mar 09 17:53:02 vlx129-tpb pcs[186]: 17:53:02.749 [error] GenServer #PID<0.1693.0> terminating
Mar 09 17:53:02 vlx129-tpb pcs[186]: Supervisor hello_listener_supervisor had child {hello_zmq_listener,{ex_uri,"zmq-tcp",
                                                                 {ex_uri_authority,undefined,"172.30.3.10",27001},
                                                                 [],undefined,undefined}} started with hello_zmq_listener:start_link({ex_uri,"zmq-tcp",{ex_uri_authority,undefined,"172.30.3.10",27001},[],undefined,undefined}) at <0.1693.0> exit with reason no match
Mar 09 17:53:02 vlx129-tpb pcs[186]: CRASH REPORT Process <0.1693.0> with 0 neighbours exited with reason: no match of right hand value invalid_identity in hello_zmq_listener:handle_info/2 line 99 in gen_server:terminate/7 line 804
Mar 09 17:53:02 vlx129-tpb pcs[186]: gen_server <0.1693.0> terminated with reason: no match of right hand value invalid_identity in hello_zmq_listener:handle_info/2 line 99

in case of network failure it would be ideal if hello could cash the messages to be send and only discard the context after the connection is long dead (to be determined by heartbeats?).

the client must also identify the dead connection after a similar timeout, then crash the hello client (which is to be restarted by a supervisor).

surik commented 8 years ago

Here is a fix for this crash: https://github.com/travelping/hello/commit/bb7943cd4f14e4254fb8200adb5d4e063424c096. Please update hello.

I will look more into reconnecting client after context disappeared.

thz commented 8 years ago

but that is just changing crash->errorlog, while not touching the reason

surik commented 8 years ago

@thz yes, I just noticed that here isn't the last hello.

hwinkel commented 8 years ago

hhm. was such a robust behavior not one of the reasons to have something like hello. The underlaying tools like zeroMQ or Erlang distribution or lately http2 give you already the building blocks. I.e. ZeroMQ would reconnect sockets after TCP failure and Erlang Node Communication kann also be configured with more aggressive connectivity checks. And as far i know someone has implemented hello echos above the transport for exactly this reasons.

surik commented 8 years ago

Hello doesn't have any subscription mechanism. Here is just a notify method which allows to send message from server to client. In this case subscription mechanism is a part of application layer. As improvements for implementing good pubsub over hello with reused socket I may suggest add feedback for notify method. If happened something wrong you will know about that and be able to remove subscription in your application.