Data loss while restarting a chunkserver during a rebalance/replicate

eigood commented 9 years ago

So, this is a bit long. We've had this bug happen 3 times. Once long ago, then twice within a week.

I had originally noticed the problem, when restarting a chunkserver, that chunks were lost. In that scenario, I wasn't fully aware of what was going on. I was annoyed, but not too worried, as the missing data was in throw-away files, so it was no big deal.

Then, recently, we came to the realization that combining 4 drives into a raid0 was a poor decision on our part. It increased single-thread performance, but it meant that when a drive died(which they always do), we would have to replicate much more data to become safe again(we use goal=2). So, I had gone about splitting each raid0 into 4 separate directories for mfshdd.cfg.

The general procedure for this, is to mark the directory for removal(prefix it with "" in mfshdd.cfg), let moosefs duplicate all the chunks, then remove the drive. However, this replication is *slow, and hurts performance. The writes are placed randomly onto the other nodes, and all these writes are starving the cluster. So, I decided to do a plain rsync of the chunk folders, spreading them out manually onto spare free space around the cluster. This seemed to work well. Since one single drive did not have sufficient space to hold the mount point I was trying to remove, I had to split the chunk folders up. I did [0-7]* to one machine, and [8-F]* on another. After the initial rsync, while the chunkserver was active, I ran it again(less copied this time). Then, on the other nodes that were holding the duplicate copies, I added those folders into their mfshdd.cfg, but I marked those as removal when I did so. This would keep new data from being written to those locations(in theory). I also marked the original location to be removed, then waited for all chunks to have no counts of 1. Once that was hit, I stopped the original chunkserver. While that original chunkserver was stopped, moosefs(we aren't yet upgraded to lizardfs) continued to rebalance those half-copied to-remove locations.

As I went about splitting the raid0, I also needed to repartition each of the 4 drives. That's a standard practice, and I had no problems with that. I failed one drive of the raid5 OS, killed the raid0, repartitioned that 1 drive, created a raid1 with 3 missing drives, did a pvmove, then added one mount point into mfshdd.cfg(what was on sdd). Moosefs started to move data onto this drive.

I then did the same procedure to the remaining 3 drives, repartition, add to the raid1, add all 3 mount points to mfshdd.cfg. I then restarted the chunkserver on this original machine. At this point, I lost data. I didn't know what the cause was at this time. mfsfilerepair did not fully help; in some cases, it erased the block, in others it found a different version. This installation primarily runs a backuppc installation, which meant that we had a corrupted backup set.

I then proceeded to do the same to production, not knowing the full scope of what was to come.

Of course, we had data loss there as well. And, as luck would have it, the machines that had corrupted filesystems(when a 64M chunk suddenly becomes all 0, bad things happen), required restore from the backup system, but that was corrupted to, and it was preventing proper reads. Much hair pullling at this point.

During all this, my coworker thought that the problem might have stemmed from me manually copying and splitting those chunked folders around . Not knowing exactly, I took his suggesting to remove them from the other nodes(we are working on production at this point). Moosefs is rebalancing, some chunks have a single copy, most have 2, some have 3. I restart the chunkservers. More data loss.

More chunks are at 0 copies. I restart the primary chunkserver(the one I was trying to do work on). More data loss.

At this point, I begin not to trust moosefs at all. I investigate some of these chunks, and notice that they are truncated. Not full size(like they should be). This is also based on reading syslog. So, I think to myself, why would a chunk be truncated. And then it hits me. Maybe when a chunkserver is restarted, while it is replicating, and a chunk is coming in, the full data is not received from the network, and therefor not written to disk. Or, the full data has not yet been written to the remote note, before the process is stopped.

I attempt to read the lizardfs code(hoping I could see something that works better), and it appears that the worker that is responsible for copying the data does not have any incoming signals to tell it to abort the transfer(signals being a general term, I understand that pthreads are in use). And, I don't see how such a restart during a replication would be communicated back to the master, to have it abort the replication transaction on all nodes involved.

If this "restart during replication" actually causes data loss, then that is a huge issue. Because under normal situations, disks will need to be marked for removal, and it's entirely possible that other unrelated nodes might experience an outage, or other maintenance could be occuring. Even if the goal was increased, I don't think that would help.

eigood commented 9 years ago

I plan on attempting to replicate this problem, but it's been a matter of time. Still recovering from the sting of this issue.

onlyjob commented 9 years ago

Could it be related to #252? Besides your description of problem is too long and involve actions that normally should not be done (e.g. rsync-ing).

cloudweavers commented 9 years ago

I would say first of all that nowadays there is such a big difference between MooseFS and LizardFS that it is extremely difficult to say it applies to Lizard at all, especially after the substantial rewrite that ended a few months ago. In general, the fact that there is no "in-between" signal to the worker is not relevant, because the write either completes with a correct CRC result or it does not, and in this case another chunkserver gets elected to perform the write. If the write gets completed (with version updated and all that) then there is no data loss by definition, and if the write does not get completed there is no version update at all, so no data loss as well. What is unclear in the description of the steps is how it's possible for your cluster to get different versions at all. You mentioned that you used mfsfilerepair, and that it erased blocks-but that means that you had lost chunks as well, which means that either the rsync did not complete, or that it wrote something clearly inconsistent; maybe an inplace rsync that did fail? Then you mentioned further filesystem damage, which means that those chunks should have been marked as missing, and at that point any mfsfilerepair would clearly destroy everything. All in all, I would not vouch for MooseFS (I switched to Lizard several months ago), but I would say that in our experience with 50+ clusters, some subjected to nearly pathological torture (either by the users themselves, sometimes by natural disasters) we basically never lost a single byte. When we presented our storage system at a London expo we even felt so sure that we made the visitors pull out the power cable on random nodes to show that it was working properly.

eigood commented 9 years ago

Except that shutting down a chunkserver, copying the entirety of the filesystem to a new location, then starting up the chunkserver, should work, right?

eigood commented 9 years ago

We have planned on switching to lizardfs, but it always takes time. Maybe now is that time. The reason I asked here, instead of on moosefs, is because I like the response times I get from this upstream; you guys rock.

cloudweavers commented 9 years ago

Yes, we did it several times (especially during complex fault compensation it may be an effective approach)

blink69 commented 9 years ago

@eigood To be sure, you have this problem on moose not on lizard ? If yes please switch to lizard and check it again :)

eigood commented 9 years ago

We are taking steps in that direction. New hardware. This problem will be one that we test out first.

On 06/17/2015 04:30 PM, blink69 wrote:

@eigood To be sure, you have this problem on moose not on lizard ? If yes please switch to lizard and check it again :)

Reply to this email directly or view it on GitHub: https://github.com/lizardfs/lizardfs/issues/297#issuecomment-112954857

4Dolio commented 9 years ago

I have had success doing what you describe, In our large production environment. But I do not recommend doing this if you do not understand how the chunk server service keeps track of it's chunks. It is critical to understand that the chunk service does not recognize the addition of nor the removal of "side-loaded chunk files" until it is restarted and re-scans it's disks. Also be aware that the chunk server disks keep chunks in sub-folders named {00..FF} and that the last 2 characters of a given chunk file indicate which folder it is stored in. I always took care to make sure that files stayed inside the proper sub-folder.

I believe not restarting the chunk server service (to re-scan the disks) is where you went wrong with your process. Your source servers did not report their lost chunks and your target server did not report their gained chunks because you did not restart the service so it could re-scan the disks. I believe that had you restarted your target chunk server service before you removed the files your lost chunks would have returned.

We have a 24+21 disk JBOD with 2 controlling chunk servers. I moved chunks between various file systems on those disks by hand numerous times. I could move ~24TB in ~12 hours with rsync while natural re-balancing would take many weeks. Keep in mind that this is always done withing the scope of a single chunk server. Doing so between chunk servers would interfere with the goal values. You can do it between chunk servers, but you must be aware of the resulting changes in goals and copies. It is also technically possible to run 2 chunk server services on a single server by modifying the default port on which the chunk server communicates (If for example you wanted one server to store 2 copies during such a manual replication process). Anyway...

My method for moving chunk data around on a single chunk server is roughly: rsync from source to target disks (you can do it live). Stop or mark for removal the source disks, restart/HUP chunk server service to make it re-scan. rsync again to pickup smaller number of changes during original transfer. Remove source disks, restart/HUP chunk server service. Once you verify you have not lost anything, then remove the source disk permanently, reformat, whatever, repeat to move the chunks back. Keep in mind that I do this with at least a goal of 2 just to be sure not to lose anything.

lizardfs / lizardfs

Data loss while restarting a chunkserver during a rebalance/replicate #297