re-balancing chunks bug

quantcast / qfs

Quantcast File System

https://quantcast.atlassian.net

Apache License 2.0

643 stars 170 forks source link

re-balancing chunks bug #237

Open shtephan opened 5 years ago

shtephan commented 5 years ago

Re-replication for re-balance utilization doesn't check that the destination chunk server for re-balancing already has the chunk stored on it. If the chunk initially had 2 replicas, one on the re-balance source chunk server and one on the re-replication destination chunk server, after the rebalancing the file will only have one replica, on the destination one. I can provide additional logs from the environment I'm using it to prove it if needed.

mikeov commented 4 years ago

I'd guess the implicit assumption here is that file replication is greater than 1, possibly 2. If replication is 1 then, if I read the description correctly, the system works as expected. If replication is greater than 1 then it isn't. Though in both cases re-replication would not occur, only replication check might need to delete extra replicas in the case when actual replication is greater than the target.

shtephan commented 4 years ago

Our problem is that we loose redundancy in the case described above for files with replication >1 after the re-balancing process ends, a file with replication=2 may have only one replica (because the rebalancing tries to overlap the rebalanced chunk over an other existing one) So, my opinion is that when a rebalance of a chunk occurs the servers already holding the chunk must be excluded from the possible list for moving the chunk that is currently targeted for rebalancing

shtephan commented 4 years ago

A similar issue happens for evacuation. We evacuated our entire cluster to another one that temporarily joined in order to upgrade disks and on this move we lost several hundreds files, just from the move. Maybe someone can have a look on the selection algorithm of the target server when a chunk is evacuated.

botsbreeder commented 4 years ago

I will add some more details to the issue above: so we were evacuating ~hundred of disks. Chunks were re-replicating. The problem is that after replicating a chunk metaserver was making decision which chunk to drop based on free space. Disks we evacuate were getting more and more of free space, so this is what was happening: copy chunk from evacuating disk to new location -> delete chunk from new location (based on free space criteria) Files were RS 6,3 encoded, means replication is 1, so after the chunk is deleted there is no copy, the only way to restore is RS recovery, but we were loosing chunks faster than it can recover. So in short period we left with RS encoded files having many "holes", in this case ~20% of chunks were lost, in many places we lost more that 3 chunks in each 9 chunk group so RS was not able to recover.

mikeov commented 4 years ago

In general it might not be trivial to interpret trace messages and make correct conclusion. If I remember right, there are multiple reasons for chunk deletion to be issued and the corresponding trace messages to appear in the meta server debug trace log.

For example, chunk replica deletion could be issued due replication or recovery failure. In such case chunk replica deletion can be issued in order to ensure that a possibly partial replica gets deleted.

I’d recommend to start from inspecting chunk server log, looking for “media / disk” IO failures. Inspecting chunk and meta server counters available in meta server web UI might also help to diagnose the problem.

I do not recall encountering a problem similar to the one described here in the last few years on any of the actively used file systems with a few petabytes per day read and write load (append with replication 2 and RS 6+3), and in end to end testing with and without failure injection.

It is possible that the problem is due to the specifics of the system configuration or use pattern.

For example presently file system has no continuous background verification mechanism that verifies all chunk replicas. Existing “chunk directory” / “disk” health check only includes periodically creating, writing, reading back, and deleting a small file. It is conceivable that lack of such mechanism might manifest itself as noticeable replication / recovery failures due to latent undetected media read failures.

Another possibility that comes to mind is that replication / recovery parameters do not match HW configuration capabilities / IO bandwidth, resulting in IO failures / timeouts. Though in such a case after replication / recovery activity stops recent enough code should be able to re-discover replicas that became unavailable due to IO load.

If the problem is due to the meta server replica / layout management bug, then perhaps the ideal way would be to try to reproduce the problem in the simple / small (3 chunk server nodes for example) / controlled test setup. Exiting test scrips can be used to create such a setup, for example src/test-scripts/run_endurance_mc.sh.

avodaniel commented 4 years ago

I created separated repository based on 2.2.1 (https://github.com/hagrid-the-developer/qfs/) that tries to simulate the issue and also to fix it (but probably only partly).

It contains docker file that runs build and part of tests. It runs recoverytest.sh for 3 chunks servers, not 2. Finally it runs test-lost-chunks.sh, that:

Creates 25 files.
Runs evacuation of 127.0.0.1:20400
Waits some time (100s)
Prints files in chunk directories.

I tried to run docker build . -t qfstest:v2.2.1 with disabled commit https://github.com/hagrid-the-developer/qfs/commit/3528d7e81dcdae7d922b69c72754000d5549e77c that tries to fix the issue. I was investigating chunk 131217 from file abc-010.xyz. Related logs are in subdirectory failure (https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure), from failure/metaserver-drf.err:

      07-29-2020 17:51:39.752 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20401 reason: evacuation
      ...
      07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:11920) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20401 OK replications in flight: 9
      ...
      07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:11503) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 hibernated: 0 needed: -1
      07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:12470) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 2 all: 2 keeping: 127.0.0.1 20400 20400 0.983188 discarding: 127.0.0.1 20401 20401 0.98691
      ...
      07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
      ...
      07-29-2020 17:51:39.775 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20401 chunk: 131217 version: 1 removed: 1

It seems that chunk was evacuated, but after evacuation finished and chunk was removed from evacuated server, LayoutManager::CanReplicateChunkNow() was called and it decided that chunk had too many replicas and that replica in the new destination should be removed, because evacuated server had a lot of free space. Unfortunately, there is a short time window, when it is difficult to distinguished evacuated server from normal server. That's why I tried to introduce other set of chunks that should be or are evacuated but haven't been removed yet in commit https://github.com/hagrid-the-developer/qfs/commit/3528d7e81dcdae7d922b69c72754000d5549e77c.

The issue doesn't happen every time, on my laptop with macOS it happed about 50% of times.

mikeov commented 4 years ago

Thank you for your report.

Here is likely fix: https://github.com/quantcast/qfs/commit/1b5a3f5fb3bf5c490344de9e11180c7ead35478f https://github.com/quantcast/qfs/commit/1b5a3f5fb3bf5c490344de9e11180c7ead35478f

— Mike.

On Aug 2, 2020, at 9:05 AM, hagrid-the-developer notifications@github.com wrote:

I created separated repository based on 2.2.1 (https://github.com/hagrid-the-developer/qfs/ https://github.com/hagrid-the-developer/qfs/) that tries to simulate the issue and also to fix it (but probably only partly).

It contains docker file that runs build and part of tests. It runs recoverytest.sh for 3 chunks servers, not 2. Finally it runs test-lost-chunks.sh, that:

Creates 25 files. Runs evacuation of 127.0.0.1:20400 Waits some time (100s) Prints files in chunk directories. I tried to run docker build . -t qfstest:v2.2.1 with disabled commit hagrid-the-developer/qfs@3528d7e https://github.com/hagrid-the-developer/qfs/commit/3528d7e81dcdae7d922b69c72754000d5549e77c that tries to fix the issue. I was investigating chunk 131217 from file abc-010.xyz. Related logs are in subdirectory failure (https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure), from failure/metaserver-drf.err:
  07-29-2020 17:51:39.752 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20401 reason: evacuation
  ...
  07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:11920) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20401 OK replications in flight: 9
  ...
  07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:11503) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 hibernated: 0 needed: -1
  07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:12470) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 2 all: 2 keeping: 127.0.0.1 20400 20400 0.983188 discarding: 127.0.0.1 20401 20401 0.98691
  ...
  07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
  ...
  07-29-2020 17:51:39.775 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20401 chunk: 131217 version: 1 removed: 1
It seems that chunk was evacuated, but after evacuation finished and chunk was removed from evacuated server, LayoutManager::CanReplicateChunkNow() was called and it decided that chunk had too many replicas and that replica in the new destination should be removed, because evacuated server had a lot of free space. Unfortunately, there is a short time window, when it is difficult to distinguished evacuated server from normal server. That's why I tried to introduce other set of chunks that should be or are evacuated but haven't been removed yet in commit hagrid-the-developer/qfs@3528d7e https://github.com/hagrid-the-developer/qfs/commit/3528d7e81dcdae7d922b69c72754000d5549e77c.

The issue doesn't happen every time, on my laptop with macOS it happed about 50% of times.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/quantcast/qfs/issues/237#issuecomment-667692457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAREIWMQVV75MP4HLFQAB33R6WFDPANCNFSM4I3CINGA.

avodaniel commented 4 years ago

@mikeov I tried to apply the change https://github.com/quantcast/qfs/commit/1b5a3f5fb3bf5c490344de9e11180c7ead35478f to my branch https://github.com/hagrid-the-developer/qfs/commit/b3963db37a18eed874bed4a9c5e1caa861946bbe and it seems that the issue is still present. It also seems that wait is always 0. Related logs from one test run are in https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002, important part of metaserver-drf.err:

    08-05-2020 20:26:43.563 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20402 reason: evacuation
    ...
    08-05-2020 20:26:43.592 DEBUG - (ChunkServer.cc:2199) 127.0.0.1 20402 cs-reply: -seq: 1805093697390353841 log: 0 0 3891 status: 0  replicate chunk: 131217 version: 1 file: 32 fileSize: -1 path:  recovStripes: 0 seq: 1805093697390353841 from: 127.0.0.1 20400 to: 127.0.0.1 20402
    ...
    08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:11936) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20402 OK replications in flight: 9
    08-05-2020 20:26:43.593 DEBUG - (LayoutManager.cc:11517) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 wait: 0 hibernated: 0 needed: -1
    08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:12486) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 3 all: 3 keeping: 127.0.0.1 20400 20400 0.889967 discarding: 127.0.0.1 20402 20402 0.894062
    ...
    08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20400 logseq: 0 0 3954 type: STL chunk: -1 version: 0 remove: 1 chunk-stale-notify: sseq: -1 size: 1 ids: 131217
    08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
    ...
    08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20402 logseq: 0 0 3957 type: DEL chunk: 131217 version: 0 remove: 1 meta-chunk-delete: chunk: 131217 version: 0 staleId: 0
    08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20402 chunk: 131217 version: 1 removed: 1

mikeov commented 4 years ago

Thank you for your report. Here is likely fix: https://github.com/quantcast/qfs/commit/432924b4c842eed8cadf5514505dae7b68673245 https://github.com/quantcast/qfs/commit/432924b4c842eed8cadf5514505dae7b68673245

— Mike.

On Aug 6, 2020, at 10:10 AM, hagrid-the-developer notifications@github.com wrote:

@mikeov https://github.com/mikeov I tried to apply the change 1b5a3f5 https://github.com/quantcast/qfs/commit/1b5a3f5fb3bf5c490344de9e11180c7ead35478f to my branch hagrid-the-developer/qfs@b3963db https://github.com/hagrid-the-developer/qfs/commit/b3963db37a18eed874bed4a9c5e1caa861946bbe and it seems that the issue is still present. It also seems that wait is always 0. Related logs from one test run are in https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002 https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002, important part of metaserver-drf.err:
08-05-2020 20:26:43.563 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20402 reason: evacuation
...
08-05-2020 20:26:43.592 DEBUG - (ChunkServer.cc:2199) 127.0.0.1 20402 cs-reply: -seq: 1805093697390353841 log: 0 0 3891 status: 0  replicate chunk: 131217 version: 1 file: 32 fileSize: -1 path:  recovStripes: 0 seq: 1805093697390353841 from: 127.0.0.1 20400 to: 127.0.0.1 20402
...
08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:11936) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20402 OK replications in flight: 9
08-05-2020 20:26:43.593 DEBUG - (LayoutManager.cc:11517) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 wait: 0 hibernated: 0 needed: -1
08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:12486) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 3 all: 3 keeping: 127.0.0.1 20400 20400 0.889967 discarding: 127.0.0.1 20402 20402 0.894062
...
08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20400 logseq: 0 0 3954 type: STL chunk: -1 version: 0 remove: 1 chunk-stale-notify: sseq: -1 size: 1 ids: 131217
08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
...
08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20402 logseq: 0 0 3957 type: DEL chunk: 131217 version: 0 remove: 1 meta-chunk-delete: chunk: 131217 version: 0 staleId: 0
08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20402 chunk: 131217 version: 1 removed: 1
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantcast/qfs/issues/237#issuecomment-670059480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAREIWNKSYC5XJVNVCLIWX3R7LPZRANCNFSM4I3CINGA.

avodaniel commented 4 years ago

Thanks @mikeov, it seems it really helped. We will do more tests on a larger cluster.

mikeov commented 4 years ago

Great. Thank you for update.

— Mike.

On Aug 13, 2020, at 11:22 PM, hagrid-the-developer notifications@github.com wrote:

Thanks @mikeov https://github.com/mikeov, it seems it really helped. We will do more tests on a larger cluster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quantcast/qfs/issues/237#issuecomment-673909487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAREIWN6VBLFY3QJJYSYWCTSATJ3BANCNFSM4I3CINGA.

avodaniel commented 4 years ago

It seems that chunks really don't get lost now. But sometimes evacuation is really slow. Usually all chunks are moved quickly and then it takes very long time to transfer the last chunk: there are long delays between copies of its pieces (like 5minutes). And sometimes it takes long time until file evacuation is changed to evacuation.done. Is this expected? It is inconvenient but acceptable. We use QFS from master till commit cb3e89811413add728acb809bd6918fe84687c4f (included).