glandium / vmfs-tools

http://glandium.org/projects/vmfs-tools/
GNU General Public License v2.0
76 stars 29 forks source link

Read corruption on large files (vmdk) #8

Open ocgltd opened 11 years ago

ocgltd commented 11 years ago

I've discovered that when reading large files from a vmfs volume, vmfs-tools does not consistantly present the same data to the reading application (i.e. corruption). For example, after copying a large vmdk file to an external disk, I ran sha1sum on the source and destination 3 times. As you can see below, the source (vmfs) presents a different sum on almost every run:

Compare 1 failed on file [/mnt/vmfs/dvr/dvr-flat.vmdk]. MD5/SHA source=c8809289e7e48549c9594400a66e1b987947c326, destination=15a78ae7140b3ab82009a35bc64f32bc32a60274 Compare 2 failed on file [/mnt/vmfs/dvr/dvr-flat.vmdk]. MD5/SHA source=c8809289e7e48549c9594400a66e1b987947c326, destination=15a78ae7140b3ab82009a35bc64f32bc32a60274 Compare 3 failed on file [/mnt/vmfs/dvr/dvr-flat.vmdk]. MD5/SHA source=7d1d7ce34910758ae75545da8b0decbafdcb2b02, destination=15a78ae7140b3ab82009a35bc64f32bc32a60274

I ran the sha1sum under valgrind using debugvmfs but found only one error (not sure it's related):

valgrind --leak-check=full --show-reachable=yes /usr/local/sbin/debugvmfs /dev/sda3 cat /PBX2/PBX2-flat.vmdk | sha1sum -b ==4945== Memcheck, a memory error detector ==4945== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==4945== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==4945== Command: /usr/local/sbin/debugvmfs /dev/sda3 cat /PBX2/PBX2-flat.vmdk ==4945== ==4945== Warning: noted but unhandled ioctl 0x5382 with no size/direction hints ==4945== This could cause spurious value errors to appear. ==4945== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==4945== Conditional jump or move depends on uninitialised value(s) ==4945== at 0x40A303: vmfs_vol_open (vmfs_volume.c:223) ==4945== by 0x407A9F: vmfs_fs_open (vmfs_fs.c:203) ==4945== by 0x40295D: main (debugvmfs.c:675) ==4945==

This same problem presents on 3 different VMware ESXIi 4.1 hosts (we are hoping to use vmfs-tools as our bare metal backup). We are running vmfs-tools 0.2.5 under Centos 6.2 x86_64..

glandium commented 11 years ago

Is /dev/sda3 sata, scsi, iscsi, ... ? How big is the vmfs volume, and how big is dvr-flat.vmdk ?

ocgltd commented 11 years ago

Here's info on the file: -rw------- 1 root root 68719476736 Aug 7 00:54 dvr-flat.vmdk but this has failed on 32GB vmdk files too. The volume is 1.5TB On this system there are 2 mirrored SATA drives sitting behind an LSI raid controller. It's worth noting that I have only seen this error on vmdk files (so size may be related)

ocgltd commented 11 years ago

Any update on this issue? Still using vmfs-tools for backup and still experiencing the error above....

ocgltd commented 11 years ago

Is anyone else doing a checksum/md5 to verify integrity? I can't believe I'm the only one experiencing this...

moryb41 commented 9 years ago

You are definitely not alone. My 1.9 TB file resides on a 3.0 TB NetApp SAN LUN presented to a CentOS server via iSCSI. I first encountered problems trying to rsync the VM files to another NAS device:

# rsync -av --progress --bwlimit=20000 /mnt/vmfs1/SRV-DCSPLUNK/ /mnt/qnap/SRV-DCSPLUNK
sending incremental file list
SRV-DCSPLUNK_3-flat.vmdk
1979120929792 100%   18.88MB/s   27:46:03 (xfer#1, to-check=8/22)
rsync: read errors mapping "/mnt/vmfs1/SRV-DCSPLUNK/SRV-DCSPLUNK_3-flat.vmdk": Input/output error (5)
WARNING: SRV-DCSPLUNK_3-flat.vmdk failed verification -- update discarded (will try again).
SRV-DCSPLUNK_3-flat.vmdk
1979120929792 100%   18.89MB/s   27:45:17 (xfer#2, to-check=8/22)
rsync: read errors mapping "/mnt/vmfs1/SRV-DCSPLUNK/SRV-DCSPLUNK_3-flat.vmdk": Input/output error (5)
ERROR: SRV-DCSPLUNK_3-flat.vmdk failed verification -- update discarded.

sent 3958725043956 bytes  received 52 bytes  19799812.66 bytes/sec
total size is 2062038396264  speedup is 0.52
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
[root@centos ~]# 

Other useful information:

netapp> lun show /vol/myvol/qt/lun
        /vol/myvol/qt/lun      3t (3298534883328) (r/w, online, mapped)

# parted -l
Model: NETAPP LUN (scsi)
Disk /dev/sdb: 3299GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  3299GB  3299GB

# df -h /mnt/vmfs1
Filesystem      Size  Used Avail Use% Mounted on
/dev/fuse       3.0T  2.0T  1.1T  64% /mnt/vmfs1

# ls -lh SRV-DCSPLUNK_3-flat.vmdk
-rw------- 1 root root 1.8T Apr 10 10:03 SRV-DCSPLUNK_3-flat.vmdk

# file SRV-DCSPLUNK_3-flat.vmdk
SRV-DCSPLUNK_3-flat.vmdk: ERROR: cannot read `SRV-DCSPLUNK_3-flat.vmdk' (Input/output error)

# md5sum SRV-DCSPLUNK_3-flat.vmdk
md5sum: SRV-DCSPLUNK_3-flat.vmdk: Input/output error

This tool is a great asset when you need it, but large file support is becoming the norm everywhere it seems.

jugleni commented 7 years ago

Friend, I have the same problem. solved? Thank you