Wday merge - exposed issues

martinsumner commented 3 weeks ago

Some issues exposed by merging in wday tests.

group.sh

The group test script now stops when it has a failed test, rather than continuing and reporting the list of tests and their results at the end.

location

The location change broke backwards compatibility in making the configuration of claim limited to the function name e.g. choose_claim_v. Previously it had been {Module, Function}. Anyone using this config, such as to use choose_claim_v3, would now be broken (including the workday test). As an additional complication the claim functions moved from the riak_core_claim module (which was renamed in refactoring to riak_core_membership_claim.

The algorithm_supported function needs to be changed:

-spec algorithm_supported(ClaimAlgorithm :: atom()) -> boolean().
algorithm_supported(ClaimAlgorithm) ->
    erlang:function_exported(riak_core_claim_swapping, ClaimAlgorithm, 3)
    orelse
      erlang:function_exported(riak_core_membership_claim, ClaimAlgorithm, 3)
      orelse
        erlang:function_exported(riak_core_claim, ClaimAlgorithm, 3).

Also the configuration of the claim function:

{riak_core, [
            {choose_claim_fun,      ClaimAlgorithm},

verify_handoff_zlib

Support for the legacy encode_zlib has been removed as part of Riak 2.9.0, but the test continued to pass, as the test had never set the configuration of the encoding correctly. Setting the encoding to encode_zlib correctly, causes for handoff to fail.

The test should be removed, and also the defaulting of the capability fetch to the legacy value (or indeed the capability check should be removed altogether). There can be issues when nodes have just started with capability checks falling back to default values, even where no cluster members require the legacy setting.

verify_riak_stats

There are a number of additional stats that may be added to organisation-specific versions. A specific rt_config item of organisation should exist so that organisation-specific stats can be checked for inclusion, without others having issues with missing keys.

e.g.

organisation_stats() ->
    case rt_config:get(organisation) of
        <org_name> ->
            <org_name>_stats();
        nhse ->
            nhse_stats();
        bet365 ->
            bet365_stats()
    end.

verify_2i_limit

The wday branch fails due to a change on the secondary_index_tests stream_loop/1 function to error on the "Got a wat" message, rather than loop. This change seems like the right thing to do.

The issue is that a previous HTTP query has a straggling message, that then leads to a new PB stream receiving a rogue {Ref, done} message prior to receiving any genuine data.

This looks like an error in the riak erlang http client, it should be waiting for a done message before closing its receive loop.

How to fix this in the riak erlang client isn't obvious though. Even when streaming isn't requested, ibrowse streaming is used to fetch the results (and in term streaming index queries within riak). This is then processed with the help of the mochiweb_multipart module for added complexity. In this case as a full response has been received, there is a {done, Continuation} followed by a done. But a done might not always follow in other circumstances.

The riak erlang http client probably needs radical changes rather than a simple fix. The use of stream_index in response to get_index is wrong I think. Using stream_index generates extra work, which might not be desired. It would be preferable for get_index to do a simple httpc:request, or at least an ibrowse_send_req without stream_to (ibrowse) and stream in the GET options.

basic_command_line

Fails on testing riak console in OTP 26 (with node down). No obvious issues with running riak console.

This passes on OTP26 on nhse-develop-3.4, but not when merged in with wday-develop-3.2 - either using the rt:interact or the original rt:console form of the test.

??

nextgenrepl_bouncingtomb

Not waiting after config change to fully restart - i.e. waited to be pingable but not riak_kv service to start, and so was failing to have sufficient vnodes available when config-change/start was followed by a coverage query. There appears not to be a behaviour change wday vs nhse branches. But the additional wait resolves the issue

nextgenrepl_rtq_peerdiscovery

After forcing a change (in Cluster B), there is a non-immediate call to verify the change leads to a change of peers in Clusters A and C when manually prompting peer discovery. However, if automated peer discovery has occurred before the manual prompt, the manual prompt may lead to no changes - and this will lead to less than length(Nodes) reporting a change of peer. Auto discovery kicks in every 60 seconds, so this should be intermittent at worst, but the timings of the test seem to regularly align to prompt this.

Some stability gained by removing the wait after the change, and also making sure the initial max time has elapsed before commencing the test (so that the next tick will be at a random interval in the larger ?STD_MAX time). being strict on equality is probably not necessary either.

verify_mr_prereduce_node_down

Potential {error, timeout} when doing PUT during node stop. It is stop not stop_and_wait (and no wait for handoffs). Preflists could still be changing, need to look at forwarding and re-forwarding. As this isn't a test of PUT, bypassing this by waiting for now.

verify_bitcask_expiry_stats

This fails when gathering the stat (perhaps not available until a future merge). Using the organisation config to bypass the stat check for now.

replication2

Not sure why this is failing now, and didn't before. Issue is fairly consistent with the part of the test that fails the cluster manager leader in each cluster.

When the leader fails in Cluster A (say node 3 is stopped).
Node 1 will start riak_repl2_fscoordinator.
Starting riak_repl2_fscoordinator requests a connection from the connection manager, expecting the connection request to eventually reply with a socket.
The request initially fails as Connection Manager located no endpoints for: {rt_repl,"B"}. Will retry.
The retry requests fail, but at some stage the connection_manager receives a call to disconnect - <0.2772.0>@riak_core_connection_mgr:disconnect_from_target/2:439 Disconnecting from: {rt_repl,"B"} - I think prompted by the leader being stopped on ClusterB.
The call to disconnect leads to all the pending requests for that destination to be cancelled.
When the next retry is prompted for the connection for the riak_repl2_fscoordinator, it is now cancelled - and is silently dropped.
The riak_repl2_fscoordinator is now permanently in state whereby it is started, but the socket is undefined. any cast to start_fullsync will be ignored (silently) as the socket is undefined:

handle_cast(start_fullsync, #state{socket=undefined} = State) ->
    %% not connected yet...
    {noreply, State#state{pending_fullsync = true}};

I suspect something has changed in the timing of the test (perhaps the stops in ClusterA and ClusterB are closer together), to mean this now fails. But it appears to be a genuine issue. The riak_repl2_fscooordinator appears to assume that starting a connection request will result always in success or failure response, when instead it can get nil response. There then appears to be nothing to prompt the system out of this state.

The obvious fix, is to fail any requests that are cancelled, when they are cancelled (in riak_core_connection_mgr), and this appears to make the test work.

The test then intermittently fails when checking real-time replication during a node stop. This does PUT, and stops the node being PUT to during the stream of PUTs. It then counts how many PUTs fail, and how many GETs fail for the same key-space in the other cluster - in effect, expecting every PUT that succeeded to be found in the other cluster.

However, I'm not sure this is right by design. The repl being queued is done via a postcommit hook, and the postcommit hook is fired after the reply to the client from the FSM. so it remains possible that a write could succeed, despite the node stopping before the write was added to the replication queue. Hence the read errors on the sink, could exceed the write errors on the source.

This might be very unlikely, but with changes to OTP versions the possibility of the client winning the race could increase. The postcommit state is triggered by a zero timeout, so it could be interrupted by a shutdown message.

replication2_ssl

This test will pass for OTP24, but not for OTP26. This is because the default tls_options used within riak_core_ssl_util are passed to both the client and server side, even when the options are client-only or server-only. OTP26 will error rather than ignore such options.

https://github.com/nhs-riak/riak_core/pull/12

.... [work ongoing] .....

martinsumner commented 3 weeks ago

Groups passing so far:

kv_all
2i_all
nextgenrepl
datatypes
pipe_all
ensemble
mapred_all
rtc_all
eleveldb_only
bitcask_only

Groups blocked

admin_all

Groups to do

repl_all

tburghart commented 2 weeks ago

Helpful comments, so much left to update ...

tburghart commented 2 weeks ago

I like the idea of an org_stats (name TBD) config property, though maybe additive and subtractive variants ... OTOH, it would be good to get all of our [collective] updates pushed to reduce or eliminate the disparity. Still, the config options could alleviate much of the pain during development cycles. For consideration, they may apply to more than just verify_riak_stats - maybe a scoping mechanism would be worthwhile:

    {<test_name-atom>, [
        {add_stats, [
            <stat-name-atom> ...
        ]},
        {sub_stats, [
            <stat-name-atom> ...
        ]}
    ]}

Within the scope of the outer <test-name-atom> the test should be able to define whatever it wants, just using what I expect will be relevant to verify_riak_stats. I'm working on more focused per-test context available through riak_test_runner:metadata/? or a similar API, I'll consider something like this as well.

martinsumner commented 1 week ago

As part of this testing, I spotted an issue with group.sh. Within group.sh there is use of while read "$groupfile" do ... done. However, the group files need to have a newline after every line to be included, and in some of the files (e.g. groups/ensemble) there is no newline on the final entry.

I think in the past I've spotted this in some group files, and added a newline, but we shouldn't rely on remembering to do this:

{ cat "$groupfile"; echo; } | while read t; do if [ -n "$t" ]; then cp $TEST_EBIN/$t.beam $BASE_DIR/group_tests/$GROUP; fi done

nhs-riak / riak_test