threefoldtech / 0-stor_v2

Apache License 2.0
3 stars 1 forks source link

Zstor continues trying to connect to zdb removed from config after hot reload #124

Open scottyeager opened 1 week ago

scottyeager commented 1 week ago

I found that when a backend is removed by destroying the zdb deployment (thus removing the namespace from a still active zdb), zstor continues trying to contact the old backend after it's replaced in the config file and a SIGUSR1 is issued to do a hot reload of the config. Furthermore, zstor does perform repair to restore the expected shard count after the config is reloaded.

Here are the steps to reproduce:

  1. Deploy zstor with some backends on the grid
  2. Delete one zdb deployment (namespace is removed)
  3. Hot load a config that replaces the deleted namespace with a new one
  4. Zstor continues trying to use the old namespace and also doesn't use the new one (no rebuilding of files)
2024-11-07 18:25:35 +00:00: INFO Reloading config after receiving SIGUSR1
2024-11-07 18:25:35 +00:00: DEBUG Config actor reloading running config from "/etc/zstor-default.toml"
2024-11-07 18:25:35 +00:00: DEBUG Config actor finished reloading running config
2024-11-07 18:25:36 +00:00: WARN Failed to get ns_info from [2a02:1802:5e:0:c477:bbff:fe20:f86b]:9900 18-702278-data2007: ZDB at [2a02:1802:5e:0:c477:bbff:fe20:f86b]:9900 1
8-702278-data2007, error operation READ caused by Namespace: not found
2024-11-07 18:25:36 +00:00: WARN Failed to get ns_info from [2a02:1802:5e:0:c477:bbff:fe20:f86b]:9900 18-702278-meta2007: ZDB at [2a02:1802:5e:0:c477:bbff:fe20:f86b]:9900 1
8-702278-meta2007, error operation READ caused by Namespace: not found

I can see from running the status command that no data is written to the new backends. In this case I replace both a data backend and a metadata backend.

What I would expect to happen in this case is for zstor to stop trying to reach the backends removed from config and rebuild both data and metadata on the newly provided backends. Restarting zstor also doesn't trigger the repair process for some reason.

iwanbk commented 2 days ago

it should be fixed by #126 as well

iwanbk commented 2 days ago

it should be fixed by https://github.com/threefoldtech/0-stor_v2/pull/126 as well

confirmed:

iwanbk commented 2 days ago

What I would expect to happen in this case is for zstor to stop trying to reach the backends removed from config and

rebuild both data and metadata on the newly provided backends.

Restarting zstor also doesn't trigger the repair process for some reason.

Oh, just realized that there are few things in above description.

@scottyeager i suggest you create new issue for those things if you have more info it. If no more info, i could create it for you.

scottyeager commented 1 day ago

Hi @iwanbk,

Thanks for looking into this. Indeed there could be a couple extra issues from this one. I wanted to test after this one is resolved and see if the other issues still arise in the course of normal use.

Let me test some with the new code and I'll open more detailed issues for any problems remaining.

iwanbk commented 15 hours ago

hi @scottyeager i assigned this issue to you for checking and for this

Indeed there could be a couple extra issues from this one

scottyeager commented 1 hour ago

Core issue has been resolved and zstor no longer tries to connect to the deleted namespaces after config reload. I do still see a bunch of WARN Failed to delete removed metric by label: Error: missing labels logs after the config reload, but those eventually stop.

Leaving this open for now in case I need to split off some new issues, since I didn't have a chance to test the other behaviors yet.