kyleabeauchamp / HP35_Data

0 stars 0 forks source link

Possible Data Corruption Issues #1

Open kyleabeauchamp opened 11 years ago

kyleabeauchamp commented 11 years ago
I ran into a data integrity error when decompressing xtc-with-water.tar.bz2.
Most of the data is recovered, but as near as I can tell, 16 generations are missing from RUN1309.

Also, there may be one or two bad trajectories, which often result from Folding@Home users who have unstable machines.

badmutex commented 11 years ago

To recover data, use bzip2recover, to write the blocks to separate files, remove the bad blocks, and recombine everything.

1) separate the blocks

$ bzip2recover xtc-with-water.tar.bz

2) remove the bad blocks

# check all the blocks
$ bzip2 -dtvv rec0*.tar.bz.bz2 >block-check.log 2>&1

# find the bad blocks
$ grep -B 1 error recovery-check.log | grep bz2

# move the bad blocks to the 'error' directory, and stagger out
# the contigious groups of 'ok' blocks. E.g. if blocks 10, 15
# where bad, move 1-9, 11-14, 16-20 into separate directories
mkdir error recovery{1,2}
$ mv rec0022{0,1}*.tar.bz* error
$ for i in `seq -w 1 00219`; do mv rec${i}xtc*.tar.bz* recovery1; done
$ mv rec*.tar.bz* recovery2

3) recombination

# Combine all the contiguous blocks
$ bzip2 -dc recovery1/* >recovery1.tar
$ bzip2 -dc recovery2/* >recovery2.tar

4) adjust headers using find_tar_headers.pl

# the problem now is that while tar can extract recovery1.tar, it
# will fail with an 'unexpected eof' message and be unable to
# extract recovery2.tar since the initial offset is not known.
# Use 'find_tar_headers.pl' to find the header (22716 in this case)
$ ../find_tar_headers.pl recovery2.tar | head -1
recovery2.tar:22716:xtc-with-water/RUN1309/CLONE0/frame94.xtc:311414

# Now use 'tail' to start at the offset
tail -c +22716 recovery2.tar >recovery2_fixed.tar

Now the files can be extracted from the tar archives

$ tar xvf recovery1.tar
$ tar xvf recovery2_fixed.tar
kyleabeauchamp commented 11 years ago

Thanks Badi. I'll also try to re-upload this dataset directly from the FAH data--this will have the added benefit of giving users longer trajectories.