rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Use of erlang send causing badarg crashes #261

Closed adrianroe closed 2 years ago

adrianroe commented 2 years ago

Erlang send is used for IPC to named instances on remote machines https://github.com/rabbitmq/ra/blob/main/src/ra_server_proc.erl#L1462

This causes badarg exceptions to be raised if any of the addressed instances are not present.

It is fairly easy to trigger - we have code with bootstrap nodes manager-a, manager-b and witness and then a dynamically added node encoder-a (there's also an encoder-b, but you can trigger the issue with just one dynamic node).

Start all bootstrap nodes Start encoder-a and make it the leader (not sure if make leader is required, but it is what our test does) Kill encoder-a Restarting encoder-a triggers a badarg error 4 times out of 5.

The error does not seem to occur if all of the nodes (including encoder-a) are bootstrap nodes.

We have experimented with putting a try / catch around the send block and then returning a noconnect response, but that causes crashes on https://github.com/rabbitmq/ra/blob/main/src/ra_server_proc.erl#L1201 instead

We have also tried folding over the To members in the send function to send the messages one by one to see what impact that has (there are always multiple destinations in the To field when the error occurs), but that results in a leader_saw_append_entries_rpc_in_same_term error instead

We came across the above as part of our testing for the changes we are making for https://github.com/rabbitmq/ra/issues/251 but the issue also occurs on current main without any transfer-leader changes...

adrianroe commented 2 years ago

badargStackTrace.txt

adrianroe commented 2 years ago

I guess we could look to see if the message was of the form {'$gen_cast',...} and then return ok instead of noconnect. I'll try that and see how things fair...

adrianroe commented 2 years ago

Feeling a bit like whack a mole as that (returning ok instead) makes other errors pop up instead. I'll stop guessing :)

lukebakken commented 2 years ago

Thanks for the report. The most helpful thing at this point would be to provide a repo with code we can clone, compile, and run to see this behavior so we don't have to guess how to reproduce it.

adrianroe commented 2 years ago

Thanks. I'll try to get a repro in straight Erlang - we write our code in Purerl (Purescript that compiles to erlang) which probably isn't what you are after!

kjnilsson commented 2 years ago

Ok I'm not 100% sure what is going on here but this comment "We have also tried folding over the To members in the send function" makes me think that something isn't used right and that To in this case is a list of pids or similar. This is not supported. To needs to be a pid() or a registered name or {RegisteredName, node()}.

kjnilsson commented 2 years ago

Sending a message over distribution to a non existent pid or {RegName, node()} will not result in an error. It is true that sending a message to a non existent locally registered name will error and may be something we need to handle but that is the special case.

adrianroe commented 2 years ago

Interesting about the remote case. I have to confess, when I saw To being a list in the stack trace I assumed it was an undocumented feature of erlang:send that you were using, perhaps for performance reasons. It did not occur to me that that was the cause of the badarg!

adrianroe commented 2 years ago

...hopefully it's clear to you in the stack trace attachment that the To is a list. Don't think I'm reading it that wrong!

michaelklishin commented 2 years ago

@adrianroe that list can be passed in from your own code. We don't guess on this team. Please provide an executable way to reproduce.

adrianroe commented 2 years ago

Ok - deeply embarrassed. It is indeed my bug. 1000 apologies.

I have been bitten in the past by erlang:send can return badarg and looked for a clever answer to a stupid bug (a transpose of 2 arguments in the purescript to erlang FFI code). @kjnilsson - would love to know where the remote calls not being able to return badarg comes from (other than maybe the erlang:send nif code? or doing it and seeing that it does not crash!)

Many thanks for your help and an amazing library.

michaelklishin commented 2 years ago

erlang:send/2 cannot return [many kind of] errors for remote pids for two reasons:

The only way to know if your message has reached your peer is when you receive a confirmation with a matching "reference" of some kind. This is true for Erlang, RabbitMQ, TCP implementations and many other systems.

michaelklishin commented 2 years ago

For erlang:send/2 to reason about named processes, it'd have to somehow discover (and maintain) a list of registered names on other nodes, which would run into all of the above issues plus the consistency vs. availability design decision which language and runtime developers cannot make for you.

For local processes, verifying if a pid or local name exists is trivial. So I guess they do as much as they practically can.

adrianroe commented 2 years ago

Completely agree - my surprise is probably more the badarg locally as it is still a 2-stage process (find the pid of the locally registered process then send it a message - by which time the pid might not be running!), so absence of a badarg still does not mean your message got to the target. It getting to the target obviously does not tell you whether it is ever processed etc...

It's more that the docs aren't clear to me The function fails with a badarg run-time error if Dest is an atom name, but this name is not registered. I totally get why they don't provide that feedback remotely, but had interpreted atom name to include anything in a name registry (be that local or remote). Also completely agree that not waiting for any remote name issues makes most sense - but as above, why even do that locally.

Huge thanks once again for your help.