Open scottyeager opened 1 week ago
I ran through this test again and observed slightly different behavior. This time the rebuilding never happened no matter how long I waited.
Basically it goes on like this forever:
2024-11-23 00:42:23 +00:00: DEBUG checking data dir size
2024-11-23 00:42:23 +00:00: DEBUG Directory "/data/data/zdbfs-data/" is within size limits (1048992526 < 2684354560)
2024-11-23 00:43:23 +00:00: DEBUG Starting metastore key iteration with prefix /zstor-meta/meta/
2024-11-23 00:43:23 +00:00: DEBUG checking data dir size
2024-11-23 00:43:23 +00:00: DEBUG Terminating scan: No: more data
2024-11-23 00:43:23 +00:00: DEBUG Directory "/data/data/zdbfs-data/" is within size limits (1048992526 < 2684354560)
2024-11-23 00:44:23 +00:00: DEBUG checking data dir size
2024-11-23 00:44:23 +00:00: DEBUG Directory "/data/data/zdbfs-data/" is within size limits (1048992526 < 2684354560)
2024-11-23 00:45:23 +00:00: DEBUG checking data dir size
2024-11-23 00:45:23 +00:00: DEBUG Directory "/data/data/zdbfs-data/" is within size limits (1048992526 < 2684354560)
2024-11-23 00:45:47 +00:00: INFO Triggering repair after receiving SIGUSR2
2024-11-23 00:45:47 +00:00: DEBUG Starting metastore key iteration with prefix /zstor-meta/meta/
2024-11-23 00:45:47 +00:00: DEBUG Terminating scan: No: more data
2024-11-23 00:46:23 +00:00: DEBUG checking data dir size
2024-11-23 00:46:23 +00:00: DEBUG Directory "/data/data/zdbfs-data/" is within size limits (1048992526 < 2684354560)
2024-11-23 00:47:23 +00:00: DEBUG checking data dir size
While reading the logs I got an idea from Terminating scan: No: more data
that maybe the issue is somehow linked to having an empty backend that's causing this error and interrupting the repair process.
So I changed the test to look like this:
As soon as new data is written into the fresh backend, then the rebuilding kicks off as expected. The metadata also finally gets rebuilt at this point too.
I'm beginning to suspect that this issue is really not a separate issue from https://github.com/threefoldtech/0-stor_v2/issues/131. So far I've always been replacing both a metadata and a data backend at the same time, so I can't say that this behavior is specific to data backend replacements.
I'm beginning to suspect that this issue is really not a separate issue from https://github.com/threefoldtech/0-stor_v2/issues/131. So far I've always been replacing both a metadata and a data backend at the same time, so I can't say that this behavior is specific to data backend replacements.
yes, should be the same
I'm beginning to suspect that this issue is really not a separate issue from https://github.com/threefoldtech/0-stor_v2/issues/131. So far I've always been replacing both a metadata and a data backend at the same time, so I can't say that this behavior is specific to data backend replacements.
yes, should be the same
After more checking and getting more familiar with the code, i found that it is different issue
I did the following steps:
Then I waited for the repair subsystem to kick in and restore the expected shards count using the new empty backend. I understand from here:
https://github.com/threefoldtech/0-stor_v2/blob/cd24f423488293230dec137629d2ad270f666ba4/zstor/src/actors/repairer.rs#L11
that the repair cycle should run every ten minutes. However, it was exactly 40 minutes before any rebuilds were triggered:
From here the logs contain a lot more similar rebuild messages. Actually it seems to continue on forever until the backends are full, but that's another issue.