Closed GoogleCodeExporter closed 9 years ago
As verification is a lengthy procedure, one might argue that most users would
like to know as soon as possible whether the repo is healthy or not. Continuing
the scan after an error is found might delay the answer for many hours.
Getting a corrupted repo really is a sign that something is deeply wrong with
your system. It should be a rare occurrence, and when it happens, the user
should not be led in any way to believe that it is "fixable". True, if some
blobs are missing they might be replaced from a workdir. But if some other
data, such as bloblists, are gone, they are gone forever.
If a user is skilled enough to start manually repairing a repo, I'd think that
same user is also competent to create a script that checks each blob manually.
I know that this feature would be useful to you in your current situation, but
I don't think it is appropriate to add complexity to this very essential part
of the code unless there is a strong need for it.
Original comment by ekb...@gmail.com
on 2 Mar 2012 at 4:52
Mats, I agree that a user should not be led to believe something like a missing
bloblist or other important session file is in anyway "fixable". That said, I
opened this issue due to the fact that I think there *is* an important
difference between a corrupt blob vs a damaged sessions directory.
I agree a user will want to know if there is a problem as soon as possible, but
I don't think blobs should be treated the same as the sessions folder in that
respect. If the sessions folder doesn't verify, having boar throw an error from
a failed assertion is perfectly reasonable because of the chained nature of the
sessions folder.
Blobs on the other hand are much more atomic. While of course you wouldn't want
to have any blob corruption and it is definitely an indication something has
gone wrong, there are plausible reasons why a blob might go "missing" or have
an issue that does not apply to the json files in the sessions folder.
I tend to think of these things from an end user perspective given likely use
cases or scenarios. In this case, I would propose the situation where a virus
scanner gets its hands on the boar repo and pulls out lets say 3 blobs that it
believes are infected.
The verify command should not treat all errors as equal. And given how
extremely lengthy a full verification command can be, I think it should be
structured in a way that makes it more atomic and with efficiency in mind. To
this end I think the verification procedure should do something along these
lines with the verification proceeding in order of increasingly time consuming
stages so as to alert a user of damage as early in the process as possible:
1) Verify the integrity of the sessions folder by validating bloblist/session
files with against the values in the session.md5 file
Any corruption during this phase indicates critical damage, and unless a user
knows they manually edited the session/bloblist files, they should assume the
session in question has irrecoverably "lost data" and be encouraged to restore
from a healthy clone or realize they are voiding their warranty if they attempt
to manually repair things ;)
2) After the json files are verified, the next step is to enumerate the
bloblists and verify that the blobs pointed to actually exist.
The order of blobs should be deterministic such that all other parameters being
equal blob ### is always the same (the idea being a verify operation can be
aborted and resumed from a given offset, and given how extremely time consuming
a full blob verification is, this can save A LOT of time).
Enumerating the bloblists in order of appearance but processed by session would
be a good way to do it i think. Grouping the output and processing blobs by
session (the sessions ordered by first appearance in the repo) makes much more
sense imo.
2b) Prior to doing a full rehashing of files, a much faster integrity check on
the blobs would be to make sure the size value of the blob matches what is
indicated in the bloblist. This check can be performed at the same time the
existence of blobs is being checked. If the size of the blob is wrong,
something is wrong.
3) As mentioned above, rehashing of blobs should proceed in a deterministic
order such that it would be easy to implement an offset parameter to the verify
command. Additionally, it should be possible to specify a session and verify
will only perform the relevant checks on the given session.
These types of parameters would be highly useful in situations where an import
seems to have completed successfully, but may have been interrupted and the
user wants to "be sure" everything went smoothly (but it could make the
difference between verifying 1M files or 10K files). This is especially true of
the blob verification as it is by far the most costly. Being able to constrain
the session (and possibly revision) you want to verify blobs for makes a
difference on the order or hours or even days, and more importantly when you're
dealing with such long running processes it makes the difference between
running the command or never using it period.
And to me that's the crux of the matter. Corruption in any backup system or VCS
is a big deal and being able to detect such corruption as early as possible is
important. Once a repo gets to a certain size, if your only option is verify
the *entire* repo or not verify it at all, the time involved is enough to make
people avoid the command altogether until the are *sure* something is wrong.
Being able to run constrained verify commands makes it practical to use it to
"double check" that things are ok or that that last checkin was fine, vs
something you only run when things have already gone wrong and you're trying to
figure out why.
With regard to the issue of a user finding out something is wrong vs allowing
verify to complete, I don't think it's an either or situation. There is no
reason to print out individual ### of ### progress lines imo. I think if the
progress output was more like the output of import and *progress* was tracked
using updating status lines, and only important warnings or informational
messages were printed to separate lines, then using something like assert
bad_blob or what have you, you do a if bad_blob: print "blob failed
verification:", blob.
This goes back to my contention that a missing blob is not a reason to throw
your boar repo out the window. Especially if your repo is massive. It goes back
to *why* a repo might become corrupt.
1) Disk corruption
2) An import operation that didn't complete
3) An antivirus like software blocking or removing a blob
Case one can affect any part of the repo and the verify command as it is will
catch this (and the solution is having good backups/clones)
Case two can *much* more efficiently be caught if verify accepted
session/revision parameters such that you could verify the last/most recent
imports assuming you have verified your repo up to a point in time. As it
stands, there is no good option to actually check if you think something went
wrong with a recent import, you are forced to verify *everything*
Case three is such that the boar session (the most critical boar component of
the boar backend) is perfectly intact. Missing blobs in no way prevent boar
from functioning, they just mean isolated data loss. It is easy to drop in
missing blobs in place as long as the file is recoverable or from copying it
over from a healthy boar clone.
Given that boar repos are designed to manage binary files, they can become
HUGE. If a repo is hundreds of gigs, deleting an entire repo and cloning over a
healthy repo (rather than dropping in the missing blob in place, or rsyncing
it) is one of those things that makes a difference on the order of hours, and
if you are using any kind of flash media, saves a lot of write cycles.
I'll wrap up my essay, but I think while adding arguments/options to a command
does arguably introduce complexity, it needs to be balanced against usability.
Repo corruption is serious, and the verify command is the tool boar provides to
detect it. There are only a handful of likely reason for repo corruption, and
when you are dealing with commands that require a very significant time to run,
optimizing them for common use cases is time/complexity well worth it.
And obviously, I've collected all my thoughts on the verify mechanism here, but
I don't intend this as an all or nothing recommendation. If you find anything
here you agree with in whole or in part and feel is worth implementing, great :)
/cb
Original comment by cryptob...@gmail.com
on 4 Mar 2012 at 12:22
Okay, you are raising some valid points here. For a huge (HUGE) repo, you might
not want to run a complete verify. Many users would want to run a verify during
the night to avoid affecting performance during work hours. And if your repo
takes a few days or more for a complete verify, that would be a problem. Also,
in such a case you would prefer to attempt a repair before throwing the old
repo out. Some kind of partial verify is desirable, as well as a way to list
erroneous/missing blobs.
Original comment by ekb...@gmail.com
on 4 Mar 2012 at 10:03
Original issue reported on code.google.com by
cryptob...@gmail.com
on 2 Mar 2012 at 1:32