jborg / attic

Deduplicating backup program
Other
1.11k stars 104 forks source link

Corrupted my Repo #47

Closed dragon2611 closed 10 years ago

dragon2611 commented 10 years ago

I Seem to have done something to corrupt both my attic repos, Probably didn't help that at some point in time I was an idiot and deleted .cache

Tried upgrading from the version installed by pip to the latest version from Git and running attic check but the result is the same

attic: Error: /data/backup.attic is not a valid repository

I have another repo that does the same thing

The backup.attic repo could have being accessed at the time I had to hard reboot the system after running into a BTRFS SMP bug in the debian 3.2.04 kernel (I've since built the latest stable kernel from source and yet to see a occurrence)

Anyway have I completely screwed the repo or is it just I'm being an idiot and overlooking something simple

jborg commented 10 years ago

The cache directory should automatically be recreated if missing, so it shouldn't matter that you deleted it.

what does ls -l /data/backup.attic output?

"attic: Error: /data/backup.attic is not a valid repository" indicates that Attic does not think that /data/backup.attic look like a repository at all. This check is currently implemented by making sure the repository path contains a file called "config" with at least the following contents:

[repository]
version = 1
id = xxxxxxxxxxx

So in your case I guess this file is either corrupted or missing. If so you can try to replace it with one from a newly created repository. The "id" must be unique so don't copy it from an existing repository.

After that you can give "attic check" another try. If the repository is recognized but some errors are found you might try "attic check --repair" but please make a copy of the repository before if possible.

But if btrfs managed to corrupt the "config" file chances are that other more important files have also been corrupted...

dragon2611 commented 10 years ago

Ahh I see what the problem is, the config file is present in both repos but it's empty.

I'll try your suggestion on my smaller repo (iworks2.data) I made a copy of that last night

dragon2611 commented 10 years ago

iworks2.attic# attic check -v --progress --repair /data/iworks2.attic/ attic: Warning: 'check --repair' is an experimental feature that might result in data loss.

Type "Yes I am sure" if you understand this and want to continue.

Do you want to continue? Yes I am sure Starting repository check... Adding commit tag to segment 53812 Repository check complete, no problems found. Starting archive consistency check... Rebuilding missing manifest, this might take some time...

After some time (I wasn't watching it particularly closely) my SSH session drops out and the box no longer responds on the network.

The only thing I can do is force a reboot by cycling it's power outlet, not sure if this is a BTRFS bug or not. Annoyingly the IPMI card is installed but doesn't have the associated nic so it's IPkvm option doesn't work.

Box is a supermicro X7SBI board with a E6320 cpu and 8gb ram running Debian.

I have another machine that has sufficient storage to hold this Repo, but runs Ubuntu sadly it's not running any form of Raid but it will at least allow me to copy the data there and see if the check command can complete without killing the box itself.

It wouldn't be world ending if I lost the data in the archives as it's older backups that I would have previously destroyed by now but was seeing if I could archive them using attic instead (Mainly due to the space savings from duplication) ... That and the really important data is backed up to an offsite location using a different method anyway.

Edit: Thought the Ubuntu VM wasn't using BTRFS but turns out it is, Still as it's a different Kernel and running on different hardware i'll give it a go on there anyway.

jborg commented 10 years ago

BTRFS itself shouldn't be an issue. That's what I use on my main laptop and it hasn't given me any problems. But that's just a plain one hard drive filesystem and no raid.

It's a bit strange and worrisome that the config file was damaged since it's only written to when the repository is initially created. After that it's only read (and used to acquire read/write flocks()).

All data stored in the repository is appended to segment files stored in the repo/data/x/ folders. Each segment file is about 5MB in since. Both the "Adding commit tag to segment 53812" and "Rebuilding missing manifest, this might take some time..." indicates that at least the most recent segment file is missing or corrupted.

"Rebuilding missing manifest, this might take some time..." means the the main manifest (super block) of all archives is missing so Attic will read through the entire repository trying to find as many archive headers as possible. But if you didn't see any "Found archive X" it didn't find any archives before the server died.

Overall it appears that your filesystem is in pretty bad shape. So chances are that you'll trigger the same kernel bug by just copying the repository to your Ubuntu VM.

dragon2611 commented 10 years ago

Not sure what's triggering it as seemingly copying large amounts of data to/from the volume seems ok.

I've kicked off another scrub of the FS since it crashed, The one I ran prior to it crashing didn't find any problems.

I'm wondering if I've got a hardware issue on that box, there wasn't anything useful in the kernel logs, I'll check the IPMI logs.etc to make sure it's not something like a thermal issue caused by the High CPU/IO load.

jborg commented 10 years ago

Attic does use a fair bit of ram but unless your repository is larger than 2TB, 8GB ram should be plenty. And hopefully the OOM hander should take care of that and not kill the entire server.

dragon2611 commented 10 years ago

I've ran that system out of Memory before (When playing with ZFS on the same hardware) I doubt it's an OOM issue, I'll have a look in a moment and see if the copy completed, also with the other machine being a VM I can access the console so if it does cause a kernel panic on the 2nd machine It might be easier to see whats dieing.

dragon2611 commented 10 years ago

Scrub finished on the file system and didn't find any errors.

The repository check running on the VM is still going from earlier

dragon2611 commented 10 years ago

Well not sure why my neither of my systems seemed to like attic, I'm going to remove BTRFS from the equation and try moving the Repo to a ZFS based Filesystem instead.

Although Not going to enable ZFS de-duplication, learnt the hard way that's generally a bad idea unless you have a lot of ram but then again if I can get attic working I won't need that anyway ;-)

jborg commented 10 years ago

Ok, so what happened? Did you get a kernel panic or did Attic crash?

If we think this has something to do with Attic, and not just the OS being unstable I really need some more details...

dragon2611 commented 10 years ago

On the Debian machine with the stock kernel one of the CPU cores would get stuck in IOwait and performance would tank to the point where it was almost unusable, with 3.13.4 I think it was the system would just fall over, But I wasn't able to get anything useful from the logs to determine why.

On the ubuntu machine Disk I/O dropped as low as 1MB/s (Possibly the same BTRFS SMP bug) I Gave up waiting for the check after 30+ hours of no progress.

On the Debian machine the BTRFS was a Mirror across 2 drives mounted with compress set to force, The Ubuntu machine was a single drive but I think also mounted with forced compression.

I don't think it's an Attic bug, another de-duplicating archiver I tried (DDAR) seemed to cause similar results.

I'm working on changing the underlying FS the Repo is stored on then will retry.

jborg commented 10 years ago

FYI, I've just pushed a fix that fixes a possible infinite loop during "check --repair" while repairing a corrupted segment.
So if the last line of output from "check --repair" said "attempting to recover some/filename" you might have hit this bug. This should result in 100% cpu and no disk access.

dragon2611 commented 10 years ago

Managed to Crash the box again, at least I think it's crashing it's hard to tell it completely falls off the network though.

This time I was accessing the Repo via NFS so not sure if it's related to that,

Now I've removed the BTRFS from the system I'm going to try once more but with the data once again local.

So To recap I've tried the following

Repo on Local BTRFS Filesystem - Crash Repo on remote Ext4 Filesystem mounted via NFS - Crash Repo on Local ZFS filesystem (NO Dedupe) - Not tested yet

Repo on Local BTRFS Filesystem on a Ubuntu VM on another Box - Doesn't crash but read speed plummets to 1MB/s, possibly the same SMP bug seen with BTRFS on Debian 7 kernel 3.2.0-4-amd64

Not even worried about the data that much just want to know what's knocking the thing over, might either have to see if I can get console redirection working to the serial port so I can see what happens when it "dies" or get someone to attach a KVM.

Annoyingly the IPMI card can do KVM but only if it's using it's dedicated NIC which I don't have it's using the other Onboard ethernet port atm.

Didn't notice anything in the IPMI logs to suggest overheating.

Just as a note this "Small" Repo is 266GB so each time I run the repair it takes a while, hence I'm not watching it the whole time to see when it crashes.


Out of interest is there any chance the problems could be down to the default size of the segments, 266GB @ 5MB per segmant = a lot of small files.

Edit:

The box became unresponsive momentarily just now and I thought for a moment it was going to fall over, It looks like attic is eating all the RAM and then some.

image

jborg commented 10 years ago

On 2014-02-25 22:48 , dragon2611 wrote:

Managed to Crash the box again, at least I think it's crashing it's hard to tell it completely falls off the network though.

It might be worth to run memtest86 on the server, leaving that on for a few ours might find and/or trigger something...

This time I was accessing the Repo via NFS so not sure if it's related to that,

Now I've removed the BTRFS from the system I'm going to try once more but with the data once again local.

So To recap I've tried the following

Repo on Local BTRFS Filesystem - Crash Repo on remote Ext4 Filesystem mounted via NFS - Crash Repo on Local ZFS filesystem (NO Dedupe) - Not tested yet

Repo on Local BTRFS Filesystem on a Ubuntu VM on another Box - Doesn't crash but read speed plummets to 1MB/s, possibly the same SMP bug seen with BTRFS on Debian 7 kernel 3.2.0-4-amd64

Not even worried about the data that much just want to know what's knocking the thing over, might either have to see if I can get console redirection working to the serial port so I can see what happens when it "dies" or get someone to attach a KVM.

Annoyingly the IPMI card can do KVM but only if it's using it's dedicated NIC which I don't have it's using the other Onboard ethernet port atm.

Didn't notice anything in the IPMI logs to suggest overheating.

Just as a note this "Small" Repo is 266GB so each time I run the repair it takes a while, hence I'm not watching it the whole time to see when it crashes.


Out of interest is there any chance the problems could be down to the default size of the segments, 266GB @ 5MB per segmant = a lot of small files.

I don't think so. Attic makes sure there are at most 10,000 files in each $REPO/data/X/ directory.

/ Jonas

dragon2611 commented 10 years ago

Tried it on a VM (With only 2GB ram mind) and Attic ran out of memory and was killed.

I suspect due to the size of the Repo if I wanted to successfully repair it I'd need a machine with a lot of RAM which unfortunately I don't have at the moment

image

Edit:

Tried upgrading the VM to 4GB and ran into the same problem, Can't increase it further as I don't have sufficient spare resources to do so

jborg commented 10 years ago

On 2014-02-27 10:37, dragon2611 wrote:

Tried it on a VM (With only 2GB ram mind) and Attic ran out of memory and was killed.

That's weird. 2GB (and even 1GB) should be enough to repair a repository of that size.

What's even more strange is that at the "Rebuilding missing manifest"-stage there should not even be any more memory allocated.

The only thing I can think of right now is that while rebuilding the manifest some really unfortunate non-msgpack data was fed into msgpack that got it to allocate a really large amount of ram triggering the OOM.

I'll try to figure out a more robust way of rebuilding the manifest.

I'll let you know when I have something for you to test, if you're still interested.

/ Jonas

dragon2611 commented 10 years ago

It could well be whatever happened to the Repo to damage it in the first place, Whilst it would be nice to repair it it's not world ending if I can't.

Is there anything I can do from my side that would generate you some debug data that might might it easier to pin down what's going on?

jborg commented 10 years ago

I've just pushed a new change that should make the manifest rebuild code more robust and hopefully also fix the out of memory issue you are seeing.

dragon2611 commented 10 years ago

Updated and running now,

It will probably take a while due to the size of the repository.

dragon2611 commented 10 years ago

Sorry hit the wrong button.

dragon2611 commented 10 years ago

Just to let you know the outcome, it Looks like the repository is indeed corrupted as repair reported several missing file chunks, so what ever happened to it really screwed it up.

However the good news is your changes seemed to fix the problem I was having where attic would eat all of the available memory and die.

jborg commented 10 years ago

Just to be clear. The repair was successful and the repository is now in working order, but some files are either missing or partially filled with zeros as reported by the repair process?

If so I think we can conclude that the repository was initially damaged by a btrfs related crash. Thanks for providing an excellent test case for "attic check --repair" and your feedback helped me find and fix some bugs.

Thanks!

dragon2611 commented 10 years ago

Since the repo contained archive files that would have been unusable with the damage I didn't bother trying to restore it further however as far as I can tell the repair process did work properly.

Attic is looking to be a very promising piece of software and once I fix the underlying problem with my filesystem/hardware I will probably be using it for when I want to backup Virtual machine images.