grke / burp

burp - backup and restore program
http://burp.grke.net
Other
483 stars 77 forks source link

Checksum mismatch in block #432

Closed githubcdr closed 8 years ago

githubcdr commented 8 years ago

Hi,

First of all, thanks for burp, it's a great tool and we would like to use it in production. In a test setup however, we noticed various occasions where "checksum mismatches" warnings appear at random. Restoring these files gives corrupt content, sometimes it seems files are overlapped in a restore.

The setup is running on various systems, mostly RHES 6 and 7, the burp server is running on Arch Linux. We run burp 2.0.38 with protocol 2.

# burp -av
2016-05-30 14:51:08: burp[17066] WARNING: Checksum mismatch in block for f:/home/triin/tmp/burp/vss_strip:0000/0000/0013/07A3
2016-05-30 14:51:08: burp[17066] WARNING: Checksum mismatch in block for f:/home/triin/tmp/burp/vss_strip:0000/0000/0013/07A4

When deleting everything in /usr/local/var/burp/spool/ the backup and verify succeeds. Manually deleting a backup via the client does not solve this issue. (neither does running a new backup job)

This is a strange issue, at first I thought the backup password changed during backup causing this issue, but I couldn't reproduce this. Perhaps it's related in the deduplication process or various versions of librsync? Or adding new clients, I really have no clue where to look further.

Did anyone else ran into this issue? How can I further debug this?

Thanks!

grke commented 8 years ago

Hello,

Protocol 2 is not ready for production yet.

Nobody is else has reported this. However, I have seen something similar on my test server, and I am looking at trying to figure out what the problem is.

Protocol 2 doesn't use librsync, so it's nothing to do with that.

githubcdr commented 8 years ago

Hi,

Thanks for the reply, I'm willing to beta test this so no rush, the impact however is really big, the only way to solve this issue was removing all backups from the server.

I run 10 servers in backup and will keep an eye on this issue, for now all is running normal.

grke commented 8 years ago

Hello,

Yes, I am aware that it is really big.

I spent all of today debugging and fixing two problems that I could reproduce. I have pushed the results to master now. I don't know if they fix everything. I don't know if they are the same problem as what you have. I am putting the fix onto my test server to see what happens.

With the lastest master, you will need to delete all your protocol 2 backups again. That is the nature of these bugs, and a reason why protocol 2 isn't ready for production.

githubcdr commented 8 years ago

Hi,

Really appreciated! I will test and let you know.

githubcdr commented 8 years ago

Hi @grke,

After testing some more I discovered this issue seems to appear when adding a new client. The issues has not appeared on running clients.

Two things;

  1. The verify operation for a NEW backup on a NEW client shows Checksum errors for files that are also available on other burp clients (global files like /root/.zprezto and /etc/udev/hwdb.bin)
  2. Verify commands show fails for files that can be restored correctly, perhaps this is related to xattr or acl's, I don't know..

Removing the backup via the client and starting over does not help.

# burp -av
2016-06-01 10:58:18: burp[744] WARNING: Checksum mismatch in block for f:/root/.zprezto/modules/syntax-highlighting/external/images/preview.png:0000/0000/00A1/04D0

# burp -a r -d /tmp/restore -r /root

# diff /tmp/restore/root/.zprezto/modules/syntax-highlighting/external/images/preview.png /root/.zprezto/modules/syntax-highlighting/external/images/preview.png && echo muchwin

muchwin

I don't mind debugging this issue, here to help :)

grke commented 8 years ago

Is this information from after you put the latest master on the server and all clients and deleted the entire storage area? Or are you still using the storage area from before?

grke commented 8 years ago

Ah, never mind. I think I am getting the same thing on my server.

grke commented 8 years ago

I think it is nothing to do with acls or xattrs. More likely that I have messed up the locking of data files and one client ends up overwriting the other. This should be simple enough to prove and test for, once I get a spare moment.

githubcdr commented 8 years ago

Latest master and clients builds confirmed.

It seems that this is more likely to happen on new clients, since running clients never had this issue.

grke commented 8 years ago

I have just confirmed that it is what I thought it is.

I just have to fix it now. Should be easy. :)

grke commented 8 years ago

Something else that I need to do is to make the server verify the checksums as it gets them from the client. Otherwise, a malicious, or buggy client can mess up your data files.

grke commented 8 years ago

I have just pushed a fix and test for the locking problem. My server looks much more happy. If you already upgraded your clients, you only need to upgrade your server. And wipe out your storage directory again, too.

githubcdr commented 8 years ago

Testing right now...

githubcdr commented 8 years ago

Hi @grke,

Really strange issue on a raspberry 3 running Archlinux, I try to backup /usr/bin and /usr/sbin

Start time: 2016-06-02 11:06:51
  End time: 2016-06-02 11:07:51
Time taken: 01:00
                         New   Changed Unchanged   Deleted     Total |  Scanned
                   ------------------------------------------------------------
             Files:        0         0         0         0         0 |     1147
         Meta data:        0         0         0         0         0 |        4
       Directories:        0         0         0         0         0 |        4
        Hard links:        0         0         0         0         0 |       32
        Soft links:        0         0         0         0         0 |      188
            Blocks:    25674         0     52511         0     78185 |        0
       Grand total:    25674         0     52511         0     78185 |     1375
                   ------------------------------------------------------------

             Messages:             0
             Warnings:             0

      Bytes estimated:     173443273 (165.41 MB)
      Bytes in backup:             0
       Bytes received:             0
           Bytes sent:     173427550 (165.39 MB)
--------------------------------------------------------------------------------
2016-06-02 11:07:51: burp[8610] End backup
2016-06-02 11:07:51: burp[8610] backup finished ok

A validate however fails

lfBfBfB2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/[:0000/0000/0145/0003
BB2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/[:0000/0000/0145/0006
BfBBfBBBBBBLf2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/addftinfo:0000/0000/0145/0010
BBfBfB2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/addpart:0000/0000/0145/0015
BfBBBBBfBBBBBBBBBBBBBBBBBBBBBBBBf2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/agetty:0000/0000/0145/0034
B2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/agetty:0000/0000/0145/0036
2016-06-02 11:07:59: burp[8625] WARNING: Checksum mismatch in block for f:/usr/bin/agetty:0000/0000/0145/0037

This is very easy to reproduce, just remove all data from server and rerun again, it fails 100%. I made sure both client and server run latest git version.

This is the only client out of 10 that fails verify, I have tested many times.

Can you reproduce this?

edit: another raspberry client fails verfiy on almost all files

grke commented 8 years ago

My test server is a raspberry pi, and one of its clients is itself.

After the last few patches, it can make backups and validates fine.

Are you 100% sure that you used an updated binary on the client to make the backup?

My next patch tonight will help a bit, because it will make the server quit a backup if the client sends it blocks with non-matching checksums.

grke commented 8 years ago

I think I am seeing the same problem as you - an i386 machine backing up to my pi server. The data is no longer corrupted, which is a step forward.

The problem is that the i386 appears to be generating different checksums to the pi, so when the server verifies, it doesn't match what the client gave it.

It is probably some endian-ness, or signed-ness problem somewhere in the works.

githubcdr commented 8 years ago

At least good to know that intel x86 clients are working now as expected :)

grke commented 8 years ago

I am not sure yet whether it is the i386s or the pis that are working correctly!

githubcdr commented 8 years ago

A restore does restore the file correctly, so it's purely the verify operation in burp that fails atm.

grke commented 8 years ago

I made a push today, and I think it solves all the problems that I know about. To benefit, you need to upgrade servers and clients, and delete any existing protocol2 backups. The last problem to fix was a signed-ness problem in the calculation of rabin fingerprints. You would only see it backing up from a pi to intel, or vice versa.

Also, the server side now checks incoming blocks for rabin fingerprints and checksums, which is long overdue. This means that the server will not save blocks where it disagrees with the fingerprints/checksums that a client gave it.

I will leave this issue open until you can confirm that it is working OK.

githubcdr commented 8 years ago

sounds really good, thanks and I will let you know, are we prod ready after this? :)

githubcdr commented 8 years ago

All looks good on raspberry and other clients, will test some more this weekend

grke commented 8 years ago

Hello,

I still say not production ready. As you noticed, these bugs required changing algorithms which meant that previous backups need to be wiped out. I want to keep the flexibility of being able to do that for a while longer. Also, I only just started using it myself with more than one client. I would like to be able run it for an extended period of time without finding problems.

I am closing this issue now. Just open new ones for anything else you find. Thanks for the reports and hints.