"lost async page write" or even "shutting down filesystem" in VM

yongshengma commented 5 years ago

Hello ,

I have about 80 VMs on 6 physical servers (storage nodes). I often got the same issues on some VMs. Here are a couple of snapshots from consoles:

If the error is "Shutting down filesystem", it usually means this VM is dying and I can't log in it any more. Devices used in xml as

The business on VMs is often interrupted by this issue as I have to destroy and start VMs when I find they have stopped working for some time.

Best regards, Yongsheng

redlicha commented 5 years ago

The issue is caused by virtual disks returning I/O errors, which could have a variety of causes. To get to the bottom of these, the volumedriver's log needs to be inspected. As a starting point check for errors logged by the FuseInterface component (I assume that's what you're using) and from there it's usually a good next step to zoom in on previous error messages from the thread that logged the error. Log format: <timestamp> - <hostname> - <pid>/<thread_id> - <process_name>/<component_name> - <sequence_number> - <message>

yongshengma commented 5 years ago

Hi @redlicha

I do find some messages on two nodes (physical hosts). Please let me pick up a few:

Jun 10 14:07:04 N05 volumedriver_fs.sh: 2019-06-10 14:07:04 331775 +0800 - N05 - 2822/0x00007f088e659700 - volumedriverfs/FuseInterface - 0000000008070b9c - error - convert_exceptions: "/EKP-TRAN.raw": caught fungi::IOException: Remote operation failed Read, code 0
Jun 11 03:22:02 N05 volumedriver_fs.sh: 2019-06-11 03:22:02 908780 +0800 - N05 - 2822/0x00007efe99376700 - volumedriverfs/FuseInterface - 0000000008116c3e - error - convert_exceptions: "/MDM-PRD1.raw": caught fungi::IOException: request to remote node timed out , code 0

Jun 24 14:20:52 N06 volumedriver_fs.sh: 2019-06-24 14:20:52 604788 +0800 - N06 - 2763/0x00007fd268185700 - volumedriverfs/Volume - 000000000541ca12 - info - resize: aaa865ac-195f-4f17-beee-10620d3df4bb: Resizing volume from 13107200 to 0 clusters
Jun 24 14:20:52 N06 volumedriver_fs.sh: 2019-06-24 14:20:52 604838 +0800 - N06 - 2763/0x00007fd268185700 - volumedriverfs/Volume - 000000000541ca13 - error - resize: aaa865ac-195f-4f17-beee-10620d3df4bb: Cannot shrink volume from 13107200 to 0 clusters
Jun 24 14:20:52 N06 volumedriver_fs.sh: 2019-06-24 14:20:52 604948 +0800 - N06 - 2763/0x00007fd268185700 - volumedriverfs/FuseInterface - 000000000541ca14 - error - convert_exceptions: "/web01onlinepub-01.raw": caught fungi::IOException: Cannot shrink volume aaa865ac-195f-4f17-beee-10620d3df4bb, code 0

There were a bunch of VMs reporting IO errors but I cannot find related messages from syslog on other nodes maybe because they occurred two months ago and had been scrolled out. At that time there was an accident on cluster and the vpool service was down on half of these nodes. Most of these VMs having errors just went back to normal after being rebooted.

redlicha commented 5 years ago

Hi @yongshengma ,

The first two errors above are from a redirected operations, i.e. for volumes actually hosted on node N but accessed on node M:

Jun 10 14:07:04 N05 volumedriver_fs.sh: 2019-06-10 14:07:04 331775 +0800 - N05 - 2822/0x00007f088e659700 - volumedriverfs/FuseInterface - 0000000008070b9c - error - convert_exceptions: "/EKP-TRAN.raw": caught fungi::IOException: Remote operation failed Read, code 0

This one is a read error returned by node N; further details should be in the volumedriver's log on N.

Jun 11 03:22:02 N05 volumedriver_fs.sh: 2019-06-11 03:22:02 908780 +0800 - N05 - 2822/0x00007efe99376700 - volumedriverfs/FuseInterface - 0000000008116c3e - error - convert_exceptions: "/MDM-PRD1.raw": caught fungi::IOException: request to remote node timed out , code 0

This one is a redirected request timing out. N might have been down / temporarily unavailable / ... ?

Jun 24 14:20:52 N06 volumedriver_fs.sh: 2019-06-24 14:20:52 604948 +0800 - N06 - 2763/0x00007fd268185700 - volumedriverfs/FuseInterface - 000000000541ca14 - error - convert_exceptions: "/web01onlinepub-01.raw": caught fungi::IOException: Cannot shrink volume aaa865ac-195f-4f17-beee-10620d3df4bb, code 0

This one should be unrelated to I/O errors within VMs; it's a failed attempt to resize a volume to a size smaller than the current one which is not supported.

yongshengma commented 5 years ago

Thanks！

openvstorage / framework

"lost async page write" or even "shutting down filesystem" in VM #2309