lanconnected / EnhanceIO

EnhanceIO Open Source for Linux
Other
101 stars 31 forks source link

edit cache mode,CPU stuck #18

Closed jackyding2679 closed 7 years ago

jackyding2679 commented 7 years ago

When i edit cache mode in our production environment like this: eio_cli edit -c enhanceio1 -m ro

CPU stucks like this: Jun 26 15:56:42 storage27 kernel: BUG: soft lockup - CPU#3 stuck for 22s! [eio_cli:29929] Jun 26 15:56:42 storage27 kernel: Modules linked in: iptable_filter xt_multiport ip_tables enhanceio_rand(OF) enhanceio_lru(OF) enhanceio_fifo(OF) enhanceio(OF) 8021q garp stp mrp llc bonding nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul iTCO_wdt crc32_pclmul crc32c_intel iTCO_vendor_support ghash_clmulni_intel aesni_intel wmi lrw i2c_i801 gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ioatdma lpc_ich mfd_core ses enclosure mei_me mei shpchp ipmi_si ipmi_msghandler xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt isci drm_kms_helper ttm libsas ixgbe drm ahci scsi_transport_sas libahci igb libata mdio ptp pps_core megaraid_sas i2c_algo_bit i2c_core dca Jun 26 15:56:42 storage27 kernel: CPU: 3 PID: 29929 Comm: eio_cli Tainted: GF O-------------- 3.10.0-229.el7.x86_64 #1 Jun 26 15:56:42 storage27 kernel: Hardware name: Inspur SA5212H2/SA5212H2, BIOS 2.0.4 03/25/2014 Jun 26 15:56:42 storage27 kernel: task: ffff880ee8d5b8e0 ti: ffff880917994000 task.ti: ffff880917994000 Jun 26 15:56:42 storage27 kernel: RIP: 0010:[] [] eio_finish_nrdirty+0xd8/0x1a0 [enhanceio] Jun 26 15:56:42 storage27 kernel: RSP: 0018:ffff880917997e18 EFLAGS: 00000246 Jun 26 15:56:42 storage27 kernel: RAX: 0000000000000200 RBX: 000000000000ac0e RCX: 0000000000009eda Jun 26 15:56:42 storage27 kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 Jun 26 15:56:42 storage27 kernel: RBP: ffff880917997e28 R08: 0000000000000246 R09: 00000000eb1c33c8 Jun 26 15:56:42 storage27 kernel: R10: 000000000150e700 R11: 0000000000000007 R12: 0000000000000246 Jun 26 15:56:42 storage27 kernel: R13: 00000000eb1c33c8 R14: 000000000150e700 R15: 0000000000000007 Jun 26 15:56:42 storage27 kernel: FS: 00007f52aa343740(0000) GS:ffff88085fc60000(0000) knlGS:0000000000000000 Jun 26 15:56:42 storage27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 26 15:56:42 storage27 kernel: CR2: 0000000015cb3000 CR3: 0000000922584000 CR4: 00000000001407e0 Jun 26 15:56:42 storage27 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 26 15:56:42 storage27 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 26 15:56:42 storage27 kernel: Stack: Jun 26 15:56:42 storage27 kernel: 0000000000000002 ffff880854080124 ffff880917997e78 ffffffffa03df81b Jun 26 15:56:42 storage27 kernel: ffff880079b48da0 0000000000d396d0 0000000500000000 0000000000d396d0 Jun 26 15:56:42 storage27 kernel: ffffc90016af6000 0000000000d396d0 0000000000d396d0 0000000000000000 Jun 26 15:56:42 storage27 kernel: Call Trace: Jun 26 15:56:42 storage27 kernel: [] eio_cache_edit+0x3ab/0x650 [enhanceio] Jun 26 15:56:42 storage27 kernel: [] eio_ioctl+0x265/0x310 [enhanceio] Jun 26 15:56:42 storage27 kernel: [] do_vfs_ioctl+0x2e5/0x4c0 Jun 26 15:56:42 storage27 kernel: [] ? SYSC_newfstat+0x34/0x60 Jun 26 15:56:42 storage27 kernel: [] SyS_ioctl+0xa1/0xc0 Jun 26 15:56:42 storage27 kernel: [] system_call_fastpath+0x16/0x1b

Jun 26 16:03:02 storage27 kernel: BUG: soft lockup - CPU#3 stuck for 22s! [eio_cli:29929] Jun 26 16:03:02 storage27 kernel: Modules linked in: iptable_filter xt_multiport ip_tables enhanceio_rand(OF) enhanceio_lru(OF) enhanceio_fifo(OF) enhanceio(OF) 8021q garp stp mrp llc bonding nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul iTCO_wdt crc32_pclmul crc32c_intel iTCO_vendor_support ghash_clmulni_intel aesni_intel wmi lrw i2c_i801 gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ioatdma lpc_ich mfd_core ses enclosure mei_me mei shpchp ipmi_si ipmi_msghandler xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt isci drm_kms_helper ttm libsas ixgbe drm ahci scsi_transport_sas libahci igb libata mdio ptp pps_core megaraid_sas i2c_algo_bit i2c_core dca Jun 26 16:03:02 storage27 kernel: CPU: 3 PID: 29929 Comm: eio_cli Tainted: GF O-------------- 3.10.0-229.el7.x86_64 #1 Jun 26 16:03:02 storage27 kernel: Hardware name: Inspur SA5212H2/SA5212H2, BIOS 2.0.4 03/25/2014 Jun 26 16:03:02 storage27 kernel: task: ffff880ee8d5b8e0 ti: ffff880917994000 task.ti: ffff880917994000 Jun 26 16:03:02 storage27 kernel: RIP: 0010:[] [] _raw_spin_unlock_irqrestore+0x1b/0x40 Jun 26 16:03:02 storage27 kernel: RSP: 0018:ffff880917997de0 EFLAGS: 00000246 Jun 26 16:03:02 storage27 kernel: RAX: 0000000000000246 RBX: 00000000000151ac RCX: 0000000000005c5a Jun 26 16:03:02 storage27 kernel: RDX: 000000005c5a5c5a RSI: 0000000000000246 RDI: 0000000000000246 Jun 26 16:03:02 storage27 kernel: RBP: ffff880917997de8 R08: 0000000000000246 R09: 00000000eb1c33c8 Jun 26 16:03:02 storage27 kernel: R10: 000000000150e700 R11: 0000000000000007 R12: 00000000005ef860 Jun 26 16:03:02 storage27 kernel: R13: 00000000005ef860 R14: 000000000150e600 R15: ffff8808541f1bf0 Jun 26 16:03:02 storage27 kernel: FS: 00007f52aa343740(0000) GS:ffff88085fc60000(0000) knlGS:0000000000000000 Jun 26 16:03:02 storage27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 26 16:03:02 storage27 kernel: CR2: 0000000015cb3000 CR3: 0000000922584000 CR4: 00000000001407e0 Jun 26 16:03:02 storage27 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 26 16:03:02 storage27 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 26 16:03:02 storage27 kernel: Stack: Jun 26 16:03:02 storage27 kernel: ffff880854080000 ffff880917997e08 ffffffffa03d9ef9 ffff880854080000 Jun 26 16:03:02 storage27 kernel: ffffffffa03f30a0 ffff880917997e28 ffffffffa03de1f0 0000000000000002 Jun 26 16:03:02 storage27 kernel: ffff880854080124 ffff880917997e78 ffffffffa03df81b ffff880079b48da0 Jun 26 16:03:02 storage27 kernel: Call Trace: Jun 26 16:03:02 storage27 kernel: [] eio_clean_all+0xb9/0xe0 [enhanceio] Jun 26 16:03:02 storage27 kernel: [] eio_finish_nrdirty+0xe0/0x1a0 [enhanceio] Jun 26 16:03:02 storage27 kernel: [] eio_cache_edit+0x3ab/0x650 [enhanceio] Jun 26 16:03:02 storage27 kernel: [] eio_ioctl+0x265/0x310 [enhanceio] Jun 26 16:03:02 storage27 kernel: [] do_vfs_ioctl+0x2e5/0x4c0 Jun 26 16:03:02 storage27 kernel: [] ? SYSC_newfstat+0x34/0x60 Jun 26 16:03:02 storage27 kernel: [] SyS_ioctl+0xa1/0xc0 Jun 26 16:03:02 storage27 kernel: [] system_call_fastpath+0x16/0x1b

Jun 26 16:01:38 storage27 kernel: BUG: soft lockup - CPU#3 stuck for 22s! [eio_cli:29929] Jun 26 16:01:38 storage27 kernel: Modules linked in: iptable_filter xt_multiport ip_tables enhanceio_rand(OF) enhanceio_lru(OF) enhanceio_fifo(OF) enhanceio(OF) 8021q garp stp mrp llc bonding nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul iTCO_wdt crc32_pclmul crc32c_intel iTCO_vendor_support ghash_clmulni_intel aesni_intel wmi lrw i2c_i801 gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ioatdma lpc_ich mfd_core ses enclosure mei_me mei shpchp ipmi_si ipmi_msghandler xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt isci drm_kms_helper ttm libsas ixgbe drm ahci scsi_transport_sas libahci igb libata mdio ptp pps_core megaraid_sas i2c_algo_bit i2c_core dca Jun 26 16:01:38 storage27 kernel: CPU: 3 PID: 29929 Comm: eio_cli Tainted: GF O-------------- 3.10.0-229.el7.x86_64 #1 Jun 26 16:01:38 storage27 kernel: Hardware name: Inspur SA5212H2/SA5212H2, BIOS 2.0.4 03/25/2014 Jun 26 16:01:38 storage27 kernel: task: ffff880ee8d5b8e0 ti: ffff880917994000 task.ti: ffff880917994000 Jun 26 16:01:38 storage27 kernel: RIP: 0010:[] [] _raw_spin_lock_irqsave+0x9/0x60 Jun 26 16:01:38 storage27 kernel: RSP: 0018:ffff880917997de8 EFLAGS: 00000246 Jun 26 16:01:38 storage27 kernel: RAX: 00000000000151ad RBX: 00000000000151ac RCX: 0000000000000008 Jun 26 16:01:38 storage27 kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff880854080124 Jun 26 16:01:38 storage27 kernel: RBP: ffff880917997de8 R08: 0000000000000246 R09: 00000000eb1c33c8 Jun 26 16:01:38 storage27 kernel: R10: 000000000150e700 R11: 0000000000000007 R12: 00000000005ef860 Jun 26 16:01:38 storage27 kernel: R13: 00000000005ef860 R14: 000000000150e600 R15: ffff8808541f1bf0 Jun 26 16:01:38 storage27 kernel: FS: 00007f52aa343740(0000) GS:ffff88085fc60000(0000) knlGS:0000000000000000 Jun 26 16:01:38 storage27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 26 16:01:38 storage27 kernel: CR2: 0000000015cb3000 CR3: 0000000922584000 CR4: 00000000001407e0 Jun 26 16:01:38 storage27 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 26 16:01:38 storage27 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 26 16:01:38 storage27 kernel: Stack: Jun 26 16:01:38 storage27 kernel: ffff880917997e08 ffffffffa03d9ee7 ffff880854080000 ffffffffa03f30a0 Jun 26 16:01:38 storage27 kernel: ffff880917997e28 ffffffffa03de1f0 0000000000000002 ffff880854080124 Jun 26 16:01:38 storage27 kernel: ffff880917997e78 ffffffffa03df81b ffff880079b48da0 0000000000d396d0 Jun 26 16:01:38 storage27 kernel: Call Trace: Jun 26 16:01:38 storage27 kernel: [] eio_clean_all+0xa7/0xe0 [enhanceio] Jun 26 16:01:38 storage27 kernel: [] eio_finish_nrdirty+0xe0/0x1a0 [enhanceio] Jun 26 16:01:38 storage27 kernel: [] eio_cache_edit+0x3ab/0x650 [enhanceio] Jun 26 16:01:38 storage27 kernel: [] eio_ioctl+0x265/0x310 [enhanceio] Jun 26 16:01:38 storage27 kernel: [] do_vfs_ioctl+0x2e5/0x4c0 Jun 26 16:01:38 storage27 kernel: [] ? SYSC_newfstat+0x34/0x60 Jun 26 16:01:38 storage27 kernel: [] SyS_ioctl+0xa1/0xc0 Jun 26 16:01:38 storage27 kernel: [] system_call_fastpath+0x16/0x1b

Before and during the editing,there were some error for the SATA disk:

Jun 26 13:17:36 storage27 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jun 26 13:17:36 storage27 kernel: sd 0:2:1:0: [sdb] CDB: Jun 26 13:17:36 storage27 kernel: Write(16): 8a 00 00 00 00 00 02 73 72 a8 00 00 00 08 00 00 Jun 26 13:17:36 storage27 kernel: end_request: I/O error, dev sdb, sector 41120424 Jun 26 13:17:36 storage27 kernel: sd 0:2:1:0: [sdb]
Jun 26 13:17:36 storage27 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jun 26 13:17:36 storage27 kernel: sd 0:2:1:0: [sdb] CDB: Jun 26 13:17:36 storage27 kernel: Write(16): 8a 00 00 00 00 00 02 73 72 90 00 00 00 08 00 00 Jun 26 13:17:36 storage27 kernel: end_request: I/O error, dev sdb, sector 41120400

Jun 26 15:56:16 storage27 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jun 26 15:56:16 storage27 kernel: sd 0:2:1:0: [sdb] CDB: Jun 26 15:56:16 storage27 kernel: Write(16): 8a 00 00 00 00 00 00 95 02 c8 00 00 00 08 00 00 Jun 26 15:56:16 storage27 kernel: end_request: I/O error, dev sdb, sector 9765576 Jun 26 15:56:16 storage27 kernel: sd 0:2:1:0: [sdb]
Jun 26 15:56:16 storage27 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Jun 26 15:56:16 storage27 kernel: sd 0:2:1:0: [sdb] CDB: Jun 26 15:56:16 storage27 kernel: Write(16): 8a 00 00 00 00 00 00 95 00 68 00 00 00 08 00 00 Jun 26 15:56:16 storage27 kernel: end_request: I/O error, dev sdb, sector 9764968 Jun 26 15:56:42 storage27 kernel: BUG: soft lockup - CPU#3 stuck for 22s! [eio_cli:29929]

Maybe the I/O error on the SATA disk cause the CPU stuck,any ideas?

lanconnected commented 7 years ago

If affected drive (sdb) was used as cache HDD of cache SSD , i/o errors on it would definitely cause cpu lock-ups while editing the cache.

rezass commented 7 years ago

@lanconnected If HDD has I/O errors during editing cache, i can cause CPU lockup? or If SSD has I/O errors.

lanconnected commented 7 years ago

Both situations (I/O errors on HDD or I/O errors on SSD) would cause cpu lockup while editing the cache. The reason is that eio_clean_all is called under a spinlock. CPU will be released eventually when I/O error is propagated to upper layers. AFAIK, only editing wb mode is affected. This is not a bug, but "overlocking" in EIO is a bit annoying sometimes.

jackyding2679 commented 7 years ago

messages.zip

I've upload the whole messages.

I read the code ,I don't think that eio_clean_all or eio_clean_set is called under any spinlock,but eio_clean_set is called under a semaphore. In function eio_clean_all, when all cachesets are flushed,it get a spinlock to change sysctl_active.do_clean:

     spin_lock_irqsave(&dmc->cache_spin_lock, flags);
dmc->sysctl_active.do_clean &= ~EIO_CLEAN_START;
spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);

and this goes very fast.

lanconnected commented 7 years ago

You are right about locks in eio_clean, sorry, didn't look carefully enough - shame on me :) But looking at your messages i'am surprised that cache works at all - your sdb seems to be in really bad condition. I doubt that EIO could really be blamed for cpu stucks when used on disk with virtually no healthy sector at all - you have > 4000 reported I/O errors on it not counting dropped ones.

jackyding2679 commented 7 years ago

At first,i think it is the bad condition of sdb that cause the CPU stucks. Then i pull out and plug in the SATA disk,and check the health condition of the disk,it shows OK, alse i cann't reproduce this issue. Maybe there were some physical connection problems with this disk at the time when the issue occured.