mekberg / boar

Automatically exported from code.google.com/p/boar
61 stars 8 forks source link

verify should complete even if errors are found (akin to behavior of chkdsk) #65

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Running a full verify on a repo can be a veerrrry time consuming command. If 
the entire session tree has been verified, and then the blob checksumming of a 
large (1M+) file repo is underway, after churning away for hours and hours, if 
a blob is found to be invalid verify prints an assert error and terminates.

This means that should the user choose to repair the issue by manually 
replacing the offending blob, once that one issue is addressed, the ENTIRE 
verify process needs to be restarted in order to find out if there are any 
*other* issues. 

Forcing the user to restart an extremely expensive operation from the beginning 
and rehash an enormous bloblist perhaps to find out that the 600,000th blob is 
corrupt, and then redo the process to find out the 600,001st blob is also 
corrupt, but everything else is fine is extremely tedious (and imho 
unnecessary).

I realize the encouraged workflow for boar is to maintain clones and in the 
event of corruption swap in a clone or rsync from a known good clone. But in 
the case of a new (large) import (100K+ files) that boar believes completed 
successfully, but in fact did not (as such no "good" clone exists) sometimes 
it's much simpler/more practical to swap in a good copy of the bad blob than 
having to redo a large and otherwise successful import.

The user always has the option to terminate or pause the verify process with 
the appropriate keyboard interrupt, so the verify command shouldn't make this 
decision for them. In the same way that disk/file system integrity scans run to 
completion and then output a summary of found errors, verify should behave 
likewise. It can display errors inline as it finds them, but it should present 
a "summary" of issues at the end, and run to completion unless interrupted by 
the user.

I'll open another issue for this, but given the output of verify only presents 
the hex digest of bad blobs (not the session in which they occur), and to 
account for the case where a user has an md5 has and wishes to know if they 
have any copies of this file in their repo (but don't know what session it 
might be in), find should accept a hex digest with the session parameter being 
option (if provided, the session can restrict which bloblists are recursed, 
otherwise, find defaults to searching thru all bloblists).

Original issue reported on code.google.com by cryptob...@gmail.com on 2 Mar 2012 at 1:32

GoogleCodeExporter commented 9 years ago
As verification is a lengthy procedure, one might argue that most users would 
like to know as soon as possible whether the repo is healthy or not. Continuing 
the scan after an error is found might delay the answer for many hours. 

Getting a corrupted repo really is a sign that something is deeply wrong with 
your system. It should be a rare occurrence, and when it happens, the user 
should not be led in any way to believe that it is "fixable". True, if some 
blobs are missing they might be replaced from a workdir. But if some other 
data, such as bloblists, are gone, they are gone forever. 

If a user is skilled enough to start manually repairing a repo, I'd think that 
same user is also competent to create a script that checks each blob manually. 

I know that this feature would be useful to you in your current situation, but 
I don't think it is appropriate to add complexity to this very essential part 
of the code unless there is a strong need for it.

Original comment by ekb...@gmail.com on 2 Mar 2012 at 4:52

GoogleCodeExporter commented 9 years ago
Mats, I agree that a user should not be led to believe something like a missing 
bloblist or other important session file is in anyway "fixable". That said, I 
opened this issue due to the fact that I think there *is* an important 
difference between a corrupt blob vs a damaged sessions directory.

I agree a user will want to know if there is a problem as soon as possible, but 
I don't think blobs should be treated the same as the sessions folder in that 
respect. If the sessions folder doesn't verify, having boar throw an error from 
a failed assertion is perfectly reasonable because of the chained nature of the 
sessions folder.

Blobs on the other hand are much more atomic. While of course you wouldn't want 
to have any blob corruption and it is definitely an indication something has 
gone wrong, there are plausible reasons why a blob might go "missing" or have 
an issue that does not apply to the json files in the sessions folder.

I tend to think of these things from an end user perspective given likely use 
cases or scenarios. In this case, I would propose the situation where a virus 
scanner gets its hands on the boar repo and pulls out lets say 3 blobs that it 
believes are infected.

The verify command should not treat all errors as equal. And given how 
extremely lengthy a full verification command can be, I think it should be 
structured in a way that makes it more atomic and with efficiency in mind. To 
this end I think the verification procedure should do something along these 
lines with the verification proceeding in order of increasingly time consuming 
stages so as to alert a user of damage as early in the process as possible:

1) Verify the integrity of the sessions folder by validating bloblist/session 
files with against the values in the session.md5 file

Any corruption during this phase indicates critical damage, and unless a user 
knows they manually edited the session/bloblist files, they should assume the 
session in question has irrecoverably "lost data" and be encouraged to restore 
from a healthy clone or realize they are voiding their warranty if they attempt 
to manually repair things ;)

2) After the json files are verified, the next step is to enumerate the 
bloblists and verify that the blobs pointed to actually exist.

The order of blobs should be deterministic such that all other parameters being 
equal blob ### is always the same (the idea being a verify operation can be 
aborted and resumed from a given offset, and given how extremely time consuming 
a full blob verification is, this can save A LOT of time).

Enumerating the bloblists in order of appearance but processed by session would 
be a good way to do it i think. Grouping the output and processing blobs by 
session (the sessions ordered by first appearance in the repo) makes much more 
sense imo.

2b) Prior to doing a full rehashing of files, a much faster integrity check on 
the blobs would be to make sure the size value of the blob matches what is 
indicated in the bloblist. This check can be performed at the same time the 
existence of blobs is being checked. If the size of the blob is wrong, 
something is wrong.

3) As mentioned above, rehashing of blobs should proceed in a deterministic 
order such that it would be easy to implement an offset parameter to the verify 
command. Additionally, it should be possible to specify a session and verify 
will only perform the relevant checks on the given session.

These types of parameters would be highly useful in situations where an import 
seems to have completed successfully, but may have been interrupted and the 
user wants to "be sure" everything went smoothly (but it could make the 
difference between verifying 1M files or 10K files). This is especially true of 
the blob verification as it is by far the most costly. Being able to constrain 
the session (and possibly revision) you want to verify blobs for makes a 
difference on the order or hours or even days, and more importantly when you're 
dealing with such long running processes it makes the difference between 
running the command or never using it period.

And to me that's the crux of the matter. Corruption in any backup system or VCS 
is a big deal and being able to detect such corruption as early as possible is 
important. Once a repo gets to a certain size, if your only option is verify 
the *entire* repo or not verify it at all, the time involved is enough to make 
people avoid the command altogether until the are *sure* something is wrong. 
Being able to run constrained verify commands makes it practical to use it to 
"double check" that things are ok or that that last checkin was fine, vs 
something you only run when things have already gone wrong and you're trying to 
figure out why.

With regard to the issue of a user finding out something is wrong vs allowing 
verify to complete, I don't think it's an either or situation. There is no 
reason to print out individual ### of ### progress lines imo. I think if the 
progress output was more like the output of import and *progress* was tracked 
using updating status lines, and only important warnings or informational 
messages were printed to separate lines, then using something like assert 
bad_blob or what have you, you do a if bad_blob: print "blob failed 
verification:", blob.

This goes back to my contention that a missing blob is not a reason to throw 
your boar repo out the window. Especially if your repo is massive. It goes back 
to *why* a repo might become corrupt.

1) Disk corruption
2) An import operation that didn't complete
3) An antivirus like software blocking or removing a blob

Case one can affect any part of the repo and the verify command as it is will 
catch this (and the solution is having good backups/clones)

Case two can *much* more efficiently be caught if verify accepted 
session/revision parameters such that you could verify the last/most recent 
imports assuming you have verified your repo up to a point in time. As it 
stands, there is no good option to actually check if you think something went 
wrong with a recent import, you are forced to verify *everything*

Case three is such that the boar session (the most critical boar component of 
the boar backend) is perfectly intact. Missing blobs in no way prevent boar 
from functioning, they just mean isolated data loss. It is easy to drop in 
missing blobs in place as long as the file is recoverable or from copying it 
over from a healthy boar clone.

Given that boar repos are designed to manage binary files, they can become 
HUGE. If a repo is hundreds of gigs, deleting an entire repo and cloning over a 
healthy repo (rather than dropping in the missing blob in place, or rsyncing 
it) is one of those things that makes a difference on the order of hours, and 
if you are using any kind of flash media, saves a lot of write cycles.

I'll wrap up my essay, but I think while adding arguments/options to a command 
does arguably introduce complexity, it needs to be balanced against usability. 
Repo corruption is serious, and the verify command is the tool boar provides to 
detect it. There are only a handful of likely reason for repo corruption, and 
when you are dealing with commands that require a very significant time to run, 
optimizing them for common use cases is time/complexity well worth it.

And obviously, I've collected all my thoughts on the verify mechanism here, but 
I don't intend this as an all or nothing recommendation. If you find anything 
here you agree with in whole or in part and feel is worth implementing, great :)

/cb

Original comment by cryptob...@gmail.com on 4 Mar 2012 at 12:22

GoogleCodeExporter commented 9 years ago
Okay, you are raising some valid points here. For a huge (HUGE) repo, you might 
not want to run a complete verify. Many users would want to run a verify during 
the night to avoid affecting performance during work hours. And if your repo 
takes a few days or more for a complete verify, that would be a problem. Also, 
in such a case you would prefer to attempt a repair before throwing the old 
repo out. Some kind of partial verify is desirable, as well as a way to list 
erroneous/missing blobs.

Original comment by ekb...@gmail.com on 4 Mar 2012 at 10:03