The case when storage and router iproto fiber is cancelled

filonenko-mikhail commented 2 years ago

Privet

There is case when something happened and storage fiber is cancelled (for e.g. cartridge hotreload or any other fiber killer).

Some affected snippet

vshard = require('vshard')
netbox = require('net.box')

cfg = {
    memtx_memory = 100 * 1024 * 1024,
    bucket_count = 3,
    rebalancer_disbalance_threshold = 10,
    rebalancer_max_receiving = 100,
    sharding = {
        ['cbf06940-0790-498b-948d-042b62cf3d29'] = {
            replicas = {
                ['8a274925-a26d-47fc-9e1b-af88ce939412'] = {
                    uri = 'storage:storage@127.0.0.1:3301',
                    name = 'storage_1_a',
                    master = true
                },
            },
        },
    },
}

vshard.storage.cfg(cfg, '8a274925-a26d-47fc-9e1b-af88ce939412')
box.schema.user.grant('storage', 'super', nil, nil, {if_not_exists=true})

vshard.router.cfg(cfg)
vshard.router.bootstrap()

local log = require('log')
local fiber = require('fiber')
rc, err = vshard.router.callrw(1, 'box.info')
assert(rc ~= nil)
log.info(rc)
--log.info(fiber.info())

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('before netbox call')
log.info(c:call('box.info'))

for id, f in pairs(fiber.info()) do 
    if f.name:endswith('(net.box)') then
        fiber.kill(fiber.find(id))
    end
end

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)
log.info(rc)

rc, err = vshard.router.callrw(1, 'box.info')
assert(rc == nil)

log.info('after netbox call')
local rc, res, err = pcall(c.call, c, {'box.info'})
if rc ~= true then
    log.info(res)
end

c = netbox.connect('127.0.0.1:3301', {user="storage", password="storage"})
log.info('after netbox call with reloaded connection')
log.info(c:call('box.info'))

package.loaded['vshard'] = nil
local vshard = require('vshard')
rc, err = vshard.router.callrw(1, 'box.info')
log.info(rc)
assert(rc ~= nil, tostring(err))

require('console').start() os.exit(0)

The question is, how to restart netbox connection under vshard.router? Or is it possible to be done on vshard side?

Serpentian commented 2 years ago

Actually, router and storage are not reloaded when we do something like this:

package.loaded['vshard'] = nil
local vshard = require('vshard')

As user expects everything to be reloaded, I suppose we should implement atomic reload of the whole vshard.

Speaking of restoring fibers after explicit kill of them, we can do that in replicaset.rebind_replicasets. This will restore connection when router is reloaded. The other solution is to add check if the connection's fiber is dead right here: https://github.com/tarantool/vshard/blob/dd70cfb2c5ec36ab7d5355b0024e5f6d21bb8f9f/vshard/replicaset.lua#L173-L177 As this method is invoked in replicaset_master_call fibers will be restored too.

Gerold103 commented 2 years ago

Most of replicaset methods like rebind_replicasets() are internal, people shouldn't use it in their code. A proper fix is firstly 1) make the core netbox report its worker fiber state as closed if the fiber is cancelled. I suspect it might be reported as error_reconnect or something, which is misleading - it is not reconnecting anymore. Or make netbox spawn a new fiber if the current one is cancelled. 2) replicaset_connect_to_replica() can try to check if the state == error_reconnect (or whatever the name is), then we also check the fiber state somehow (don't know if worker fiber state is reachable at all) - if it is dead/cancelled, then create a new connection. Users shouldn't need to bother with that.

tarantool / vshard

The case when storage and router iproto fiber is cancelled #341