lanconnected / EnhanceIO

EnhanceIO Open Source for Linux
Other
101 stars 31 forks source link

Apparent deadlock in writeback mode #40

Closed clipcarl closed 6 years ago

clipcarl commented 6 years ago

Hi,

I'm testing out EnhanceIO in my lab but I'm experiencing what appears to be a deadlock in the writeback locking code; I see EIO processes stuck at down_write() and down_read(). The issue appears to occur somewhat randomly but I believe only after writeback has started and while data is being read from the cached device. In my testing the deadlock always eventually occurs within a few hours to a day or two. Details follow. Please let me know what else I can do to help troubleshoot this.

Thanks, Carl

Here is the system state at one deadlock:

[root@tantor data]# uname -a
Linux tantor 4.14.33.cet #1 SMP Thu Apr 12 01:46:41 UTC 2018 x86_64 GNU/Linux

[root@tantor data]# free -h
              total        used        free      shared  buff/cache   available
Mem:           165G        1.7G        143G        360M         19G        161G
Swap:            0B          0B          0B

[root@tantor data]# df -h
Filesystem                  Size  Used Avail Use% Mounted on
tmpfs                       800M  361M  440M  46% /
overlay                     800M  361M  440M  46% /etc
overlay                     800M  361M  440M  46% /var/lib
overlay                     800M  361M  440M  46% /root
overlay                     800M  361M  440M  46% /data
overlay                     800M  361M  440M  46% /mnt
overlay                     800M  361M  440M  46% /usr/local
tmpfs                        17G  436K   17G   1% /run
dev                          10M     0   10M   0% /dev
shm                          83G     0   83G   0% /dev/shm
cgroup_root                  10M     0   10M   0% /sys/fs/cgroup
/dev/sdj1                   380M  2.4M  353M   1% /var/mnt/data
tmpfs                        83G  4.0K   83G   1% /tmp
/dev/mapper/tantor-vmware1   16T  344G   16T   3% /data/vmware1

[root@tantor data]# pwd
/proc/enhanceio/data

[root@tantor data]# cat config
src_name   /dev/md/r10
ssd_name   /dev/disk/by-id/ata-INTEL_SSDSC2BA400G4_BTHV51930A69400NGN-part2
src_size   54685646848
ssd_size   88780544
set_size          256
block_size       4096
mode                1
eviction            2
num_sets       346799
num_blocks   88780544
metadata        small
state        normal
flags      0x00000000

[root@tantor data]# cat errors
disk_read_errors       0
disk_write_errors      0
ssd_read_errors        0
ssd_write_errors       0
memory_alloc_errors    0
no_cache_dev           0
no_source_dev          0

[root@tantor data]# cat io_hist
512              1
4096       1323758
8192        341080
12288       198223
16384       223120
20480       128309
24576       118333
28672       108896
32768       109278
36864        95388
40960        93576
45056        96451
49152       106226
53248        89696
57344        86880
61440        84751
65536       221431

[root@tantor data]# cat stats
reads                        1523238529
writes                        121480088
read_hits                      22576641
read_hit_pct                          1
write_hits                     71195384
write_hit_pct                        58
dirty_write_hits                4732437
dirty_write_hit_pct                   3
cached_blocks                   9643087
rd_replace                       391320
wr_replace                       437070
noroom                            52260
cleanings                             0
md_write_dirty                 10207060
md_write_clean                        0
md_ssd_writes                    313640
do_clean                              0
nr_blocks                      88780544
nr_dirty                        1745079
nr_sets                          346799
clean_index                           0
uncached_reads                  4459936
uncached_writes                    3884
uncached_map_size                     0
uncached_map_uncacheable              0
disk_reads                   1500660872
disk_writes                    69662360
ssd_reads                      90370881
ssd_writes                    155635815
ssd_readfills                  35450928
ssd_readfill_unplugs             227564
readdisk                        4459936
writedisk                       4459936
readcache                       2834380
readfill                        4431366
writecache                     19373963
readcount                       4570648
writecount                      1735523
kb_reads                      761619264
kb_writes                      60740044
rdtime_ms                      39336574
wrtime_ms                       1690530
unaligned_ios                         0

[root@tantor data]# grep "" /proc/sys/dev/enhanceio/data/*
/proc/sys/dev/enhanceio/data/autoclean_threshold:128
/proc/sys/dev/enhanceio/data/cache_wronly:0
/proc/sys/dev/enhanceio/data/control:0
/proc/sys/dev/enhanceio/data/dirty_high_threshold:30
/proc/sys/dev/enhanceio/data/dirty_low_threshold:10
/proc/sys/dev/enhanceio/data/dirty_set_high_threshold:100
/proc/sys/dev/enhanceio/data/dirty_set_low_threshold:30
/proc/sys/dev/enhanceio/data/do_clean:0
/proc/sys/dev/enhanceio/data/mem_limit_pct:75
/proc/sys/dev/enhanceio/data/time_based_clean_interval:0
/proc/sys/dev/enhanceio/data/zero_stats:0

[root@tantor ~]# ps aux |grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2787  0.0  0.0      0     0 ?        D    May11   0:02 [eio_clean_threa]
root      3019  0.0  0.0      0     0 ?        D    May11   0:15 [xfsaild/dm-4]
root      3502  0.0  0.0      0     0 ?        D    02:15   0:00 [kworker/u49:1]
root      3514 15.3  0.0      0     0 ?        D    02:18   4:10 [kworker/u48:2]
root      3529  0.0  0.0      0     0 ?        D    02:28   0:00 [kworker/14:2]
root      3538  0.0  0.0      0     0 ?        D    02:33   0:00 [kworker/14:0]
root      8553  0.0  0.0   3360   584 pts/3    S+   02:45   0:00 grep D

[root@tantor ~]# cat /proc/2787/stack
[<ffffffff946b3b03>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffc02a48e9>] eio_clean_set+0xe9/0xa60 [enhanceio]
[<ffffffffc02a5511>] eio_clean_thread_proc+0x1f1/0x3a0 [enhanceio]
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@tantor ~]# cat /proc/3019/stack
[<ffffffffc09111f4>] _xfs_log_force+0x144/0x280 [xfs]
[<ffffffffc091d80a>] xfsaild+0x16a/0x5b0 [xfs]
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@tantor ~]# cat /proc/3502/stack
[<ffffffff946b3ad4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc02a380c>] eio_map+0x7cc/0x15a0 [enhanceio]
[<ffffffffc02a939d>] eio_make_request_fn+0x33d/0x480 [enhanceio]
[<ffffffff942c4f49>] generic_make_request+0x119/0x2a0
[<ffffffff942c5137>] submit_bio+0x67/0x140
[<ffffffffc08e881b>] xfs_submit_ioend+0x7b/0x190 [xfs]
[<ffffffffc08e8998>] xfs_vm_writepages+0x68/0x90 [xfs]
[<ffffffff941199b2>] do_writepages+0x12/0x60
[<ffffffff941afaea>] __writeback_single_inode+0x2a/0x160
[<ffffffff941b00ba>] writeback_sb_inodes+0x20a/0x4c0
[<ffffffff941b03f2>] __writeback_inodes_wb+0x82/0xb0
[<ffffffff941b06e7>] wb_writeback+0x1f7/0x210
[<ffffffff941b0cbd>] wb_workfn+0x15d/0x260
[<ffffffff94077a01>] process_one_work+0x141/0x2e0
[<ffffffff94077be8>] worker_thread+0x48/0x3a0
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@tantor ~]# cat /proc/3514/stack
[<ffffffff946b3ad4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc02a380c>] eio_map+0x7cc/0x15a0 [enhanceio]
[<ffffffffc02a939d>] eio_make_request_fn+0x33d/0x480 [enhanceio]
[<ffffffff942c4f49>] generic_make_request+0x119/0x2a0
[<ffffffffc04fbcce>] process_cell+0x24e/0x4e0 [dm_thin_pool]
[<ffffffffc04fb2ce>] do_worker+0x2ae/0x980 [dm_thin_pool]
[<ffffffff94077a01>] process_one_work+0x141/0x2e0
[<ffffffff94077be8>] worker_thread+0x48/0x3a0
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@tantor ~]# cat /proc/3529/stack
[<ffffffffc09103c4>] xlog_state_get_iclog_space+0x104/0x2d0 [xfs]
[<ffffffffc09109db>] xlog_write+0x18b/0x770 [xfs]
[<ffffffffc0912586>] xlog_cil_push+0x2a6/0x450 [xfs]
[<ffffffff94077a01>] process_one_work+0x141/0x2e0
[<ffffffff94077d6c>] worker_thread+0x1cc/0x3a0
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

[root@tantor ~]# cat /proc/3529/stack
[<ffffffffc09103c4>] xlog_state_get_iclog_space+0x104/0x2d0 [xfs]
[<ffffffffc09109db>] xlog_write+0x18b/0x770 [xfs]
[<ffffffffc0912586>] xlog_cil_push+0x2a6/0x450 [xfs]
[<ffffffff94077a01>] process_one_work+0x141/0x2e0
[<ffffffff94077d6c>] worker_thread+0x1cc/0x3a0
[<ffffffff9407d362>] kthread+0xf2/0x130
[<ffffffff9480017f>] ret_from_fork+0x1f/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
clipcarl commented 6 years ago

I forgot to mention that the more read activity there is the faster the deadlock occurs. I think.

I also forgot to mention that when the deadlock occurs the cached device is unusable. Processes that read from or write to it hang. However, the rest of the system is just fine and usable. There is nothing in the kernel log when the deadlock occurs (except for the eventual notification that kernel processes have hung).

clipcarl commented 6 years ago
[root@tantor ~]# dmesg | grep -i enhanceio
[   22.242118] enhanceio: Slow (clean) shutdown detected
[   22.242120] enhanceio: Only clean blocks exist in cache
[   22.242127] enhanceio_lru: eio_lru_instance_init: created new instance of LRU
[   22.242129] enhanceio: Setting replacement policy to lru (2)
[   22.242132] enhanceio: Allocate 346799KB (4B per) mem for 88780544-entry cache (capacity:348153MB, associativity:256, block size:4096 bytes)
[   29.550118] enhanceio: Cache metadata loaded from disk with 6 valid 0 dirty blocks
[   29.550121] enhanceio: Setting mode to write back
[   29.558057] enhanceio_lru: Initialized 346799 sets in LRU
[   31.751099] enhanceio: eio_handle_ssd_message: SSD_ADD event called for ACTIVE cache "data", ignoring!!!

[root@tantor data]# lsmod
Module                  Size  Used by
dm_snapshot            45056  0
nfsd                  327680  35
auth_rpcgss            61440  1 nfsd
oid_registry           16384  1 auth_rpcgss
lockd                  90112  1 nfsd
grace                  16384  2 nfsd,lockd
sunrpc                278528  65 auth_rpcgss,nfsd,lockd
ip6t_REJECT            16384  1
nf_reject_ipv6         16384  1 ip6t_REJECT
nf_log_ipv6            16384  5
xt_hl                  16384  22
ip6t_rt                16384  3
nf_conntrack_ipv6      16384  8
nf_defrag_ipv6         32768  1 nf_conntrack_ipv6
ipt_REJECT             16384  1
nf_reject_ipv4         16384  1 ipt_REJECT
nf_log_ipv4            16384  5
nf_log_common          16384  2 nf_log_ipv6,nf_log_ipv4
xt_LOG                 16384  10
xt_limit               16384  13
xt_tcpudp              16384  35
xt_addrtype            16384  4
nf_conntrack_ipv4      16384  8
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
xt_conntrack           16384  16
ip6table_filter        16384  1
ip6_tables             24576  1 ip6table_filter
ipv6                  483328  87 nf_conntrack_ipv6,nf_reject_ipv6,nf_defrag_ipv6
nf_conntrack_netbios_ns    16384  0
nf_conntrack_broadcast    16384  1 nf_conntrack_netbios_ns
nf_nat_ftp             16384  0
nf_nat                 32768  1 nf_nat_ftp
nf_conntrack_ftp       16384  1 nf_nat_ftp
nf_conntrack          139264  8 nf_conntrack_ipv6,nf_conntrack_ftp,nf_conntrack_ipv4,nf_conntrack_broadcast,nf_nat_ftp,nf_conntrack_netbios_ns,xt_conntrack,nf_nat
iptable_filter         16384  1
ip_tables              24576  1 iptable_filter
x_tables               40960  13 xt_LOG,ipt_REJECT,ip_tables,iptable_filter,xt_tcpudp,xt_limit,ip6t_REJECT,ip6table_filter,xt_addrtype,ip6t_rt,xt_conntrack,ip6_tables,xt_hl
xfs                   856064  1
dm_thin_pool           77824  4
dm_bio_prison          20480  1 dm_thin_pool
dm_persistent_data     81920  1 dm_thin_pool
dm_bufio               32768  2 dm_persistent_data,dm_snapshot
raid10                 57344  1
kvdo                  561152  0
uds                   290816  1 kvdo
iscsi_scst            229376  3
scst_vdisk            180224  0
scst_user              98304  0
scst_tape              16384  0
scst_raid              16384  0
scst_processor         16384  0
scst_modisk            16384  0
scst_disk              20480  0
scst_changer           16384  0
scst_cdrom             16384  0
scst                 1032192  10 scst_changer,scst_processor,scst_modisk,scst_cdrom,scst_vdisk,scst_disk,iscsi_scst,scst_raid,scst_tape,scst_user
dlm                   188416  1 scst
enhanceio_rand         16384  0
enhanceio_fifo         16384  0
enhanceio_lru          16384  1
enhanceio             184320  3 enhanceio_lru,enhanceio_rand,enhanceio_fifo
dm_writeboost          49152  0
dm_mod                135168  21 kvdo,dm_bufio,dm_thin_pool,dm_writeboost,dm_snapshot
dax                    20480  1 dm_mod
libcrc32c              16384  7 scst_vdisk,nf_conntrack,iscsi_scst,xfs,dm_persistent_data,dm_writeboost,nf_nat
zfs                  3637248  0
zunicode              331776  1 zfs
zavl                   16384  1 zfs
icp                   278528  1 zfs
zcommon                73728  1 zfs
znvpair                90112  2 zcommon,zfs
spl                   114688  4 znvpair,zcommon,zfs,icp
sr_mod                 28672  0
cdrom                  45056  1 sr_mod
pata_acpi              16384  0
crc32_pclmul           16384  0
pcbc                   16384  0
aesni_intel           188416  0
aes_x86_64             20480  1 aesni_intel
crypto_simd            16384  1 aesni_intel
glue_helper            16384  1 aesni_intel
ghash_clmulni_intel    16384  0
ata_generic            16384  0
cryptd                 24576  3 crypto_simd,ghash_clmulni_intel,aesni_intel
kvm_intel             217088  0
kvm                   458752  1 kvm_intel
iTCO_wdt               16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
irqbypass              16384  1 kvm
intel_cstate           16384  0
ata_piix               36864  0
coretemp               16384  0
crct10dif_pclmul       16384  0
crc32c_intel           24576  1
serio_raw              16384  0
libata                237568  3 ata_piix,ata_generic,pata_acpi
ixgbe                 294912  0
input_leds             16384  0
e1000e                217088  0
i2c_i801               24576  0
igb                   196608  0
mpt3sas               253952  16
lpc_ich                28672  0
mdio                   16384  1 ixgbe
i2c_algo_bit           16384  1 igb
ptp                    20480  3 ixgbe,igb,e1000e
ioatdma                49152  0
uas                    28672  0
pps_core               16384  1 ptp
i5500_temp             16384  0
dca                    16384  3 ioatdma,ixgbe,igb
raid_class             16384  1 mpt3sas
hwmon                  20480  4 ixgbe,igb,i5500_temp,coretemp
button                 16384  0
acpi_cpufreq           16384  1
lanconnected commented 6 years ago

Hi! Thanks for reporting. Does your system remain in locked state forever? Can you try if it is possible to access SSD/HDD directly while eio is locked (i.e., with dd)?

clipcarl commented 6 years ago

Hi @lanconnected. I will try to set up a test server to retest and duplicate this week and hopefully have an answer to your question soon.

Are there people using EnhanceIO on Linux 4.14 systems? If so I wonder what's different about my test setup and theirs...

clipcarl commented 6 years ago

Quick question: When setting up a new cache what tells it that a new cache needs to be initialized? (The create and setup eio_cli subcommands essentially seem to call the same ioctls. So I assume this means the driver determines whether it needs to initialize a new cache by looking to see if the superblock is valid? And if that's the case I assume the proper way to tell the driver to create a new cache is to zero the first block of the cache device? If that's the case then that's the sort of thing that should be in the documentation (but doesn't seem to be). I have tried zeroing the entire cache device to be sure but that didn't make a difference (still deadlocked).

lanconnected commented 6 years ago

That's what "persistent" parameter is for. It is set in eio_cli and passed eventually to the eio_cache_create() in eio_conf.c. It is not necessary to wipe the ssd before cache creation.

lanconnected commented 6 years ago

Closing this issue for the lack of feedback.

saba69 commented 4 years ago

Hi I face the same problem using the last version in kernel 4.14.34. The details are reported in issue #54.

Thanks

saba69 commented 4 years ago

Hi! Thanks for reporting. Does your system remain in locked state forever? Can you try if it is possible to access SSD/HDD directly while eio is locked (i.e., with dd)?

When eio is locked IO can be issued directly in both source and cache devices.

saba69 commented 4 years ago

Hi! Thanks for reporting. Does your system remain in locked state forever? Can you try if it is possible to access SSD/HDD directly while eio is locked (i.e., with dd)?

When eio is locked IO can be issued directly in both source and cache devices.

saba69 commented 4 years ago

Hi! Thanks for reporting. Does your system remain in locked state forever? Can you try if it is possible to access SSD/HDD directly while eio is locked (i.e., with dd)?

When eio is locked IO can be issued directly in both source and cache devices.