rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
798 stars 93 forks source link

`ra_log_cache_key_not_found` exception exit occured #416

Closed sile closed 2 months ago

sile commented 5 months ago

Describe the bug

The following exception was raised when processing a consistent query:

exception exit: {ra_log_cache_key_not_found,15}
      in function  ra_log_cache:fetch/2 (_build/default/lib/ra/src/ra_log_cache.erl, line 68)
      in call from ra_log:'-resend_from0/2-fun-0-'/3 (_build/default/lib/ra/src/ra_log.erl, line 932)
      in call from ra_log:'-resend_from0/2-lists^foldl/2-0-'/3 (_build/default/lib/ra/src/ra_log.erl, line 931)
      in call from ra_log:resend_from/2 (_build/default/lib/ra/src/ra_log.erl, line 915)
      in call from ra_log:handle_event/2 (_build/default/lib/ra/src/ra_log.erl, line 456)
      in call from ra_server:handle_follower/2 (_build/default/lib/ra/src/ra_server.erl, line 1123)
      in call from ra_server_proc:handle_follower/2 (_build/default/lib/ra/src/ra_server_proc.erl, line 1090)
      in call from ra_server_proc:follower/3 (_build/default/lib/ra/src/ra_server_proc.erl, line 794)
      in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1395)

Reproduction steps

I am unable to provide the reproduction steps as the chance of the exception occurring is very rare, and it happened while running test code for our non-open source product. (Feel free to ignore this issue if you think there is insufficient information.)

Instead, here is a rough outline of the test scenario:

  1. Construct a cluster consisting of 5 nodes, where each node is named as a, b, c, d, and e.
  2. Regularly monitor the availability of the cluster by running a consistent query.
  3. Divide the nodes into two groups: {a,b} forms one group, while {c,d,e} forms the other group.
    • We employ a custom Erlang distribution module that emulates a significantly slow network (when a certain flag is enabled, communication between the connected nodes is severely limited.)
  4. Restore the split cluster to its normal state.

The exception mentioned above appears to have occurred at a node in the majority group when running a consistent query (for health check) just after the 3rd step.

Expected behavior

No exception exits happen.

Additional context

No response

kjnilsson commented 4 months ago

we have seen this error a couple of times but not been able to trace it down. After the process has crashed and restarted it should be ok typically. Was this the case here?

sile commented 4 months ago

I see. Thank you for your response.

After the process has crashed and restarted it should be ok typically. Was this the case here?

Yes, our test case almost passed (except for the crash log check at the end of the test case) even if the exception occurred. So, it seems this exception did not introduce a critical problem.

kjnilsson commented 2 months ago

I am very confident that #428 will fix this issue.

sile commented 2 months ago

Great!