markfasheh / ocfs2-tools

GNU General Public License v2.0
20 stars 35 forks source link

defrag hangs on arm64, ppc64el #42

Closed panlinux closed 5 years ago

panlinux commented 5 years ago

Hi,

with version 1.8.6, a new tool was made available: defragfs.ocfs2. The debian package added a test for it, and ubuntu runs that test as well. Ubuntu runs it on more architectures, though, and while it passed in less than 1s on amd64, it hangs on arm64 and ppc64el, consuming 99% CPU. No visible I/O is happening (checked with iostat 2). and strace shows just the version being printed to stdout (ls -la and date are my own debugging, and I changed debian's test a bit to copy /usr/bin instead of /bin, which is a symlink in Ubuntu):

=== defragfs ===
+ cp -a /usr/bin /mnt
+ ls -la /mnt/
total 46
drwxr-xr-x  4 root root   824 Aug 21 17:53 .
drwxr-xr-x 20 root root  4096 Aug 21 17:52 ..
drwxr-xr-x  2 root root 34816 Aug 21 17:51 bin
drwxr-xr-x  2 root root   824 Aug 21 17:52 lost+found
+ date
Wed Aug 21 17:53:11 UTC 2019
+ defragfs.ocfs2 -v /mnt
defragfs.ocfs2 1.8.6
<stuck here>

strace:

2129  execve("/usr/sbin/defragfs.ocfs2", ["defragfs.ocfs2", "-v", "/mnt"], 0xfffffa721f38 /* 14 vars */) = 0
2129  brk(NULL)                         = 0xaaaaef690000
2129  faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
2129  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
2129  fstat(3, {st_mode=S_IFREG|0644, st_size=34705, ...}) = 0
2129  mmap(NULL, 34705, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffff95391000
2129  close(3)                          = 0
2129  openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
2129  read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0\360C\2\0\0\0\0\0"..., 832) = 832
2129  fstat(3, {st_mode=S_IFREG|0755, st_size=1446640, ...}) = 0
2129  mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff9538f000
2129  mmap(NULL, 1514856, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffff951fb000
2129  mprotect(0xffff95354000, 65536, PROT_NONE) = 0
2129  mmap(0xffff95364000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x159000) = 0xffff95364000
2129  mmap(0xffff9536a000, 11624, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9536a000
2129  close(3)                          = 0
2129  mprotect(0xffff95364000, 12288, PROT_READ) = 0
2129  mprotect(0xaaaac5884000, 4096, PROT_READ) = 0
2129  mprotect(0xffff9539c000, 4096, PROT_READ) = 0
2129  munmap(0xffff95391000, 34705)     = 0
2129  rt_sigaction(SIGTERM, {sa_handler=0xaaaac5872b70, sa_mask=[TERM], sa_flags=SA_RESTART}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
2129  rt_sigaction(SIGINT, {sa_handler=0xaaaac5872b70, sa_mask=[INT], sa_flags=SA_RESTART}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
2129  fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0), ...}) = 0
2129  brk(NULL)                         = 0xaaaaef690000
2129  brk(0xaaaaef6b1000)               = 0xaaaaef6b1000
2129  write(1, "defragfs.ocfs2 1.8.6\n", 21) = 21
<stuck here>

A few seconds of iostat 2:

$ iostat 2
Linux 5.2.0-13-generic (eoan-arm64)     08/21/19    _aarch64_   (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.97    0.38    1.61    0.03    0.00   24.01

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.04         0.27         0.00        799          0
loop1             0.62         1.09         0.00       3189          0
loop2            69.33        46.78        66.65     137482     195861
vda               7.44       119.87       601.05     352272    1766317

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             1.00         0.25         0.25          0          0
vda               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             1.00         0.25         0.25          0          0
vda               1.00         0.00        12.00          0         24

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             1.00         0.25         0.25          0          0
vda               0.00         0.00         0.00          0          0

Kernel is 5.2.0-13-generic

This is the debian commit which added the test,and you can get the full test file from there as well: https://salsa.debian.org/ha-team/ocfs2-tools/commit/4db396488fe4c058bbc2e3b81529595aee3855ef

panlinux commented 5 years ago

(Updated description changing i386 to ppc64el, I had looked at the wrong test before)

rafaeldtinoco commented 5 years ago

Although the process does not seem to be doing anything (stack wise) based on ptrace() info, and the stack does not change at all:

(c)inaddy@ocfs2arm64:~$ sudo cat /proc/3947/stack [<0>] __switch_to+0xb8/0x1a8 [<0>] (null) [<0>] 0xffffffffffffffff

It is consuming 100% USER time CPU. Which means that, if it had switched to scheduler, no user time would be consumed UNLESS there would be changes in the user stack frames. I'm getting kernel debug symbols so I can see exactly where the CPU is stuck at, but this seems like a kernel issue in this first look.

josephhz commented 5 years ago

Is it possible that it has something to do with endian format? ocfs2 tools don't handle well in big endian architecture.

rafaeldtinoco commented 5 years ago
inaddy@ocfs2defrag:~$ sudo cat /proc/1452/stack
[<0>] 0x0

inaddy@ocfs2defrag:~$ while true; do sudo cat /proc/1452/stack; done > bleh.txt
inaddy@ocfs2defrag:~$ cat bleh.txt | sort -u
[<0>] 0x0

Stack does not help me at all (not being updated in execution path from the user<-> kernel context switch). Still, process is consuming100% of user time:

%Cpu3  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1452 root      20   0    1936    428    360 R 100.0   0.0   6:27.67 defragfs.ocfs2   

Initial check on time spend scheduling to discard kernel issue like I suspected before:

inaddy@ocfs2defrag:~$ sudo perf sched timehist -p 1452

           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
     111.555973 [0003]  defragfs.ocfs2[1452]                0.000      0.000      0.000
     112.123934 [0003]  defragfs.ocfs2[1452]                0.029      0.000    567.930
     113.571854 [0003]  defragfs.ocfs2[1452]                0.018      0.000   1447.901
     115.139790 [0003]  defragfs.ocfs2[1452]                0.027      0.000   1567.908
     115.555743 [0003]  defragfs.ocfs2[1452]                0.030      0.000    415.921
     116.123707 [0003]  defragfs.ocfs2[1452]                0.027      0.000    567.936
     116.867670 [0003]  defragfs.ocfs2[1452]                0.019      0.000    743.943
     117.571630 [0003]  defragfs.ocfs2[1452]                0.025      0.000    703.934
     118.243597 [0003]  defragfs.ocfs2[1452]                0.026      0.000    671.940
     118.443583 [0003]  defragfs.ocfs2[1452]                0.028      0.000    199.957
     119.587530 [0003]  defragfs.ocfs2[1452]                0.017      0.000   1143.929
     120.123500 [0003]  defragfs.ocfs2[1452]                0.023      0.000    535.945
     121.571458 [0003]  defragfs.ocfs2[1452]                0.016      0.000   1447.941
     123.587358 [0003]  defragfs.ocfs2[1452]                0.027      0.000   2015.871
     124.123365 [0003]  defragfs.ocfs2[1452]                0.036      0.000    535.970
     125.123284 [0003]  defragfs.ocfs2[1452]                0.020      0.000    999.899
     125.603259 [0003]  defragfs.ocfs2[1452]                0.032      0.000    479.942
     126.883212 [0003]  defragfs.ocfs2[1452]                0.029      0.000   1279.924
     127.587179 [0003]  defragfs.ocfs2[1452]                0.028      0.000    703.938
     128.123153 [0003]  defragfs.ocfs2[1452]                0.027      0.000    535.946 

It spends almost no time in waiting for CPU AND absolutely no time in scheduling (as its the only task currently really running), so its something in userland indeed.... debugging it:

(gdb)

#0  0x0000ffffbf618b10 in _getopt_internal_r 
    (argc=3, argv=0xfffffffff868, optstring=0xaaaaaaaacfc0 "gvclh",
    longopts=0x0, longind=0x0, long_only=0, d=d@entry=0xffffbf6cad88 <getopt_data>, 
    posixly_correct=<optimized out>)
    at getopt.c:508

#1  0x0000ffffbf618f0c in _getopt_internal
    (argc=<optimized out>, argv=<optimized out>, optstring=<optimized out>,
    longopts=<optimized out>, longind=<optimized out>, long_only=<optimized out>, 
    posixly_correct=<optimized out>)
    at getopt.c:711

#2  0x0000aaaaaaaab43c in parse_opt
    (_mode_flag=<synthetic pointer>, argv=0xfffffffff868, argc=3) at main.c:585

#3  main (argc=3, argv=0xfffffffff868) at main.c:653

after some time:

(gdb) bt

#0  0x0000ffffbf618950 in _getopt_internal_r 
    (argc=3, argv=0xfffffffff868, optstring=0xaaaaaaaacfc0 "gvclh",
    longopts=0x0, longind=0x0, long_only=0, d=d@entry=0xffffbf6cad88 <getopt_data>, 
    posixly_correct=<optimized out>)
    at getopt.c:500

#1  0x0000ffffbf618f0c in _getopt_internal 
    (argc=<optimized out>, argv=<optimized out>, optstring=<optimized out>,
    longopts=<optimized out>, longind=<optimized out>, long_only=<optimized out>, 
    posixly_correct=<optimized out>)
    at getopt.c:711

#2  0x0000aaaaaaaab43c in parse_opt
    (_mode_flag=<synthetic pointer>, argv=0xfffffffff868, argc=3) at main.c:585

#3  main (argc=3, argv=0xfffffffff868) at main.c:653

after some more time:

(gdb) bt

#0  0x0000ffffbf618cec in _getopt_internal_r 
    (argc=3, argv=0xfffffffff868, optstring=0xaaaaaaaacfc0 "gvclh",
    longopts=0x0, longind=0x0, long_only=0, d=d@entry=0xffffbf6cad88 <getopt_data>,   
    posixly_correct=<optimized out>)
    at getopt.c:512

#1  0x0000ffffbf618f0c in _getopt_internal
    (argc=<optimized out>, argv=<optimized out>, optstring=<optimized out>,
    longopts=<optimized out>, longind=<optimized out>, long_only=<optimized out>, 
    posixly_correct=<optimized out>)
    at getopt.c:711

#2  0x0000aaaaaaaab43c in parse_opt 
    (_mode_flag=<synthetic pointer>, argv=0xfffffffff868, argc=3) at main.c:585

#3  main (argc=3, argv=0xfffffffff868) at main.c:653

Checking glibc getopt.c code (around lines 500 <-> 512) I can see we are stuck in this loop:

  if (d->__nextchar == NULL || *d->__nextchar == '\0')
    {
      /* Advance to the next ARGV-element.  */
      ...
    }

And I'll try to discover why tomorrow (and report upstream).

rafaeldtinoco commented 5 years ago

Alright, this was easier than an issue in glibc... arm64 does not handle (char) overflow, when getting the ret value from getopt(). An overflow to (char) num will be seen as 255 in arm64, instead of -1, like the getopt() from main expects, making the getopt() logic to loop forever.

I'll suggest a small patch shortly.