moby / hyperkit

A toolkit for embedding hypervisor capabilities in your application
BSD 2-Clause "Simplified" License
3.61k stars 328 forks source link

Possibly corrupt qcow2 image #221

Open 198d opened 6 years ago

198d commented 6 years ago

I'm evaluating Hyperkit as a platform for managing a development VM. I've built the hyperkit binary with the qcow2 support and have run into an issue with a possibly corrupt disk image. I was installing NixOS on the disk when I believe the nixos-install binary ran out of memory. I forcefully killed the hyperkit process, started it again with more memory and was going to try the install again. Unfortunately, now I can't get the VM to interact with the disk correctly: commands like fsck and mount just hang. Here's the output I received the last time I tried to run fsck after booting up with the install CD (disk attached to Hyperkit with ahci-hd):

# fsck /dev/sda1
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
NIXOS: recovering journal
hyperkit: [INFO] Allocator: 32765 used; 0 junk; 0 erased; 0 available; 0 copies; 0 roots; 0 Copying; 0 Copied; 0 Flushed; 0 Referenced; max_cluster = 32767
                                                                                                                                                           hyperkit: [INFO] Allocator: file contains cluster 0 .. 32767 will enlarge file to 0 .. 33279
              hyperkit: [INFO] resize: adding available clusters (Node ((x 32768) (y 33279) (l Empty) (r Empty) (h 1) (cardinal 512)))
                                                                                                                                      hyperkit: [ERROR] write sector = 4263936 length = 131072: I/O deadline exceeded
                                                                                                                                                                                                                     hyperkit: [INFO] (((description "Anonymous client 1784")
                                       (locks (((description "cluster 1") (mode Read) (released false)))))
                                                                                                            ((description "write sector = 4263936 length = 131072")
                                                                                                                                                                       (locks
                                                                                                                                                                                   (((description "cluster 3154") (mode Read) (released false))
              ((description "cluster 3") (mode Read) (released false))))))
                                                                          [   87.045565] ata7.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[   87.046829] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.047719] ata7.00: cmd 61/00:00:00:17:41/01:00:00:00:00/40 tag 0 ncq dma 131072 out
[   87.047719]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.049771] ata7.00: status: { DRDY }
[   87.050278] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.050883] ata7.00: cmd 61/00:08:00:18:41/01:00:00:00:00/40 tag 1 ncq dma 131072 out
[   87.050883]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.052611] ata7.00: status: { DRDY }
[   87.052994] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.053608] ata7.00: cmd 61/00:10:00:19:41/01:00:00:00:00/40 tag 2 ncq dma 131072 out
[   87.053608]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.055271] ata7.00: status: { DRDY }
[   87.055657] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.056167] ata7.00: cmd 61/00:18:00:1a:41/01:00:00:00:00/40 tag 3 ncq dma 131072 out
[   87.056167]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.057549] ata7.00: status: { DRDY }
[   87.057897] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.058416] ata7.00: cmd 61/00:20:00:1b:41/01:00:00:00:00/40 tag 4 ncq dma 131072 out
[   87.058416]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.059683] ata7.00: status: { DRDY }
[   87.060054] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.060546] ata7.00: cmd 61/00:28:00:1c:41/01:00:00:00:00/40 tag 5 ncq dma 131072 out
[   87.060546]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.061984] ata7.00: status: { DRDY }
[   87.062370] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.062801] ata7.00: cmd 61/00:30:00:1d:41/01:00:00:00:00/40 tag 6 ncq dma 131072 out
[   87.062801]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.064205] ata7.00: status: { DRDY }
[   87.064516] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.064937] ata7.00: cmd 61/00:38:00:1e:41/01:00:00:00:00/40 tag 7 ncq dma 131072 out
[   87.064937]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.066147] ata7.00: status: { DRDY }
[   87.066450] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.066871] ata7.00: cmd 61/00:40:00:1f:41/01:00:00:00:00/40 tag 8 ncq dma 131072 out
[   87.066871]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.068103] ata7.00: status: { DRDY }
[   87.068424] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.068861] ata7.00: cmd 61/20:48:00:20:41/00:00:00:00:00/40 tag 9 ncq dma 16384 out
[   87.068861]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.070141] ata7.00: status: { DRDY }
[   87.070446] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.070866] ata7.00: cmd 61/08:50:00:08:04/00:00:03:00:00/40 tag 10 ncq dma 4096 out
[   87.070866]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.072064] ata7.00: status: { DRDY }
[   87.072363] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.072782] ata7.00: cmd 61/10:58:00:08:80/00:00:03:00:00/40 tag 11 ncq dma 8192 out
[   87.072782]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.073996] ata7.00: status: { DRDY }
[   87.074316] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.074737] ata7.00: cmd 61/08:60:80:08:80/00:00:03:00:00/40 tag 12 ncq dma 4096 out
[   87.074737]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.075933] ata7.00: status: { DRDY }
[   87.076231] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.076670] ata7.00: cmd 61/08:68:00:09:80/00:00:03:00:00/40 tag 13 ncq dma 4096 out
[   87.076670]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.077919] ata7.00: status: { DRDY }
[   87.078220] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.078657] ata7.00: cmd 61/10:70:00:09:81/00:00:03:00:00/40 tag 14 ncq dma 8192 out
[   87.078657]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.079859] ata7.00: status: { DRDY }
[   87.080178] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.080641] ata7.00: cmd 61/10:78:00:08:00/00:00:05:00:00/40 tag 15 ncq dma 8192 out
[   87.080641]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.081959] ata7.00: status: { DRDY }
[   87.082325] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.082789] ata7.00: cmd 61/08:80:80:08:00/00:00:05:00:00/40 tag 16 ncq dma 4096 out
[   87.082789]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.084021] ata7.00: status: { DRDY }
[   87.084363] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.084813] ata7.00: cmd 61/00:88:00:09:00/01:00:05:00:00/40 tag 17 ncq dma 131072 out
[   87.084813]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.086208] ata7.00: status: { DRDY }
[   87.086536] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.086985] ata7.00: cmd 61/00:90:00:0a:00/01:00:05:00:00/40 tag 18 ncq dma 131072 out
[   87.086985]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.088313] ata7.00: status: { DRDY }
[   87.088631] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.089080] ata7.00: cmd 61/00:98:00:0b:00/01:00:05:00:00/40 tag 19 ncq dma 131072 out
[   87.089080]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.090400] ata7.00: status: { DRDY }
[   87.090718] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.091167] ata7.00: cmd 61/00:a0:00:0c:00/01:00:05:00:00/40 tag 20 ncq dma 131072 out
[   87.091167]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.092472] ata7.00: status: { DRDY }
[   87.092793] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.093286] ata7.00: cmd 61/00:a8:00:0d:00/01:00:05:00:00/40 tag 21 ncq dma 131072 out
[   87.093286]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.095026] ata7.00: status: { DRDY }
[   87.095364] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.095796] ata7.00: cmd 61/00:b0:00:0e:00/01:00:05:00:00/40 tag 22 ncq dma 131072 out
[   87.095796]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.097039] ata7.00: status: { DRDY }
[   87.097364] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.097791] ata7.00: cmd 61/00:b8:00:0f:00/01:00:05:00:00/40 tag 23 ncq dma 131072 out
[   87.097791]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.099019] ata7.00: status: { DRDY }
[   87.099353] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.099814] ata7.00: cmd 61/00:c0:00:10:41/01:00:00:00:00/40 tag 24 ncq dma 131072 out
[   87.099814]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.104163] ata7.00: status: { DRDY }
[   87.104474] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.104925] ata7.00: cmd 61/00:c8:00:11:41/01:00:00:00:00/40 tag 25 ncq dma 131072 out
[   87.104925]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.106229] ata7.00: status: { DRDY }
[   87.106536] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.106987] ata7.00: cmd 61/00:d0:00:12:41/01:00:00:00:00/40 tag 26 ncq dma 131072 out
[   87.106987]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.108441] ata7.00: status: { DRDY }
[   87.108746] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.109236] ata7.00: cmd 61/00:d8:00:13:41/01:00:00:00:00/40 tag 27 ncq dma 131072 out
[   87.109236]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.110582] ata7.00: status: { DRDY }
[   87.110940] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.111459] ata7.00: cmd 61/00:e0:00:14:41/01:00:00:00:00/40 tag 28 ncq dma 131072 out
[   87.111459]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.112774] ata7.00: status: { DRDY }
[   87.113162] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.113651] ata7.00: cmd 61/00:e8:00:15:41/01:00:00:00:00/40 tag 29 ncq dma 131072 out
[   87.113651]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.115094] ata7.00: status: { DRDY }
[   87.115399] ata7.00: failed command: WRITE FPDMA QUEUED
[   87.115823] ata7.00: cmd 61/00:f0:00:16:41/01:00:00:00:00/40 tag 30 ncq dma 131072 out
[   87.115823]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   87.117039] ata7.00: status: { DRDY }

qemu-img check and qcow-tool check both say everything is OK. I've also ran qcow-tool repair on the image. Here's the output from qcow-tool info:

((version Three) (backing_file_offset 0) (backing_file_size 0)
 (cluster_bits 16) (size 53687091200) (crypt_method None) (l1_size 100)
 (l1_table_offset ((bytes 196608) (is_mutable false) (is_compressed false)))
 (refcount_table_offset
  ((bytes 65536) (is_mutable false) (is_compressed false)))
 (refcount_table_clusters 1) (nb_snapshots 0) (snapshots_offset 0)
 (additional
  (((dirty false) (corrupt false) (lazy_refcounts false)
    (autoclear_features 0) (refcount_order 4))))
 (extensions
  ((Feature_name_table
    (((ty Incompatible) (bit 0) (feature Dirty))
     ((ty Incompatible) (bit 1) (feature Corrupt))
     ((ty Compatible) (bit 0) (feature Lazy_refcounts)))))))

Any advice or guidance is much appreciated.

198d commented 6 years ago

So, after trying again with a new disk image (created with qemu-img create) and watching the hyperkit process while running nixos-install, I noticed that the install hung when the new disk image got to roughly the same size (in terms of number of "clusters") as the other. I went ahead and used qcow-tool create to create a new disk image and everything worked. What is going on there?

djs55 commented 6 years ago

hyperkit and qcow-tool share the same implementation of qcow2 and so are 100% compatible with each other.

Since the majority of our testing focuses on qcow2 files created by qcow-tool there may still be incompatibilities with images created by qemu-img. So for hyperkit I recommend sticking to qcow-tool.

Is your reproduction case easily scriptable? If so, could you file an issue against https://github.com/mirage/ocaml-qcow? Thanks!

cfergeau commented 5 years ago

I've been having this issue too. No automated way to reproduce, but I can reproduce easily starting with the images at http://download.cirros-cloud.net/0.4.0/. I ran the first steps on a linux box though.

$ qemu-img create -f qcow2 cirros.qcow2 3G 
$ virt-resize --expand /dev/sda1 cirros-0.4.0-x86_64-disk.img cirros.qcow2

Start this image with hyperkit:

$ /Users/$USER/bin/hyperkit   -A -m 1G  -s 0:0,hostbridge -s 31,lpc \
-s 2:0,virtio-blk,file://$BASEDIR/cirros.qcow2,format=qcow \
-l com1,autopty=/Users/$USER/.hyperkit/crc/tty \
-f kexec,$BASEDIR/cirros-0.4.0-x86_64-kernel,$BASEDIR/cirros-0.4.0-x86_64-initramfs,"LABEL=cirros-rootfs ro console=tty0 console=ttyS0,115200n8"  

Connect to its serial console:

sudo screen /dev/ttys001 115200

Login to the VM, and then run:

$ dd if=/dev/zero of=/test

The freeze described here should happen shortly after.

If in the very first step, we create the qcow2 image with lazy refcounts enabled (which qcow-tool create enables by default) , then I no longer get the freeze:

$ qemu-img create -f qcow2 -o lazy_refcounts=on cirros.qcow2 3G
cfergeau commented 5 years ago

Forgot to mention, I was initially reproducing the bug on a totally different linux VM, so I suspect it can be reproduced by triggering a big enough growth of a qcow2 file without lazy refcounts