iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.63k stars 3.89k forks source link

biosnoop reports unrealistic latency #1077

Open bobrik opened 7 years ago

bobrik commented 7 years ago

Something funny is happening in biosnoop and biolatency when asked to exclude kernel queuing.

I issue 10 x 512 byte writes each followed by fsync directly to /dev/sdb that is a spinning disk and here's what I see:

$ sudo /usr/share/bcc/tools/biosnoop | fgrep --line-buffered sdb
0.000000000    paranoid       17553  sdb     R  0         4096       0.23
0.000329000    paranoid       17553  sdb     W  0         4096       0.20
4.338511000    paranoid       17553  sdb     W  0         4096       0.27
5.361153000    paranoid       17553  sdb     W  0         4096       0.26
6.383771000    paranoid       17553  sdb     W  0         4096       0.26
7.406433000    paranoid       17553  sdb     W  0         4096       0.27
8.429084000    paranoid       17553  sdb     W  0         4096       0.26
9.451788000    paranoid       17553  sdb     W  0         4096       0.26
10.474409000   paranoid       17553  sdb     W  0         4096       0.26
$ sudo /usr/share/bcc/tools/biolatency -TD
Tracing block device I/O... Hit Ctrl-C to end.
^C
22:27:42

disk = 'sdb'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |
       128 -> 255        : 63       |****************************************|
       256 -> 511        : 12       |*******                                 |
$ sudo /usr/share/bcc/tools/biolatency -TDQ
Tracing block device I/O... Hit Ctrl-C to end.
^C
22:29:34

disk = 'sdb'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |
       128 -> 255        : 64       |****************************************|
       256 -> 511        : 9        |*****                                   |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 10       |******                                  |

Note that here I see my exact 10 requests, but I don't see that without queuing.

ivan@36s8:~$ sudo go run /tmp/paranoid.go -file=/dev/sdb -size=512 -count=10
2017/03/29 22:28:04 Synced in 27453 μs
2017/03/29 22:28:05 Synced in 22506 μs
2017/03/29 22:28:06 Synced in 22503 μs
2017/03/29 22:28:07 Synced in 22401 μs
2017/03/29 22:28:08 Synced in 22482 μs
2017/03/29 22:28:09 Synced in 22615 μs
2017/03/29 22:28:10 Synced in 22485 μs
2017/03/29 22:28:11 Synced in 22513 μs
2017/03/29 22:28:12 Synced in 22597 μs
2017/03/29 22:28:13 Synced in 22533 μs

If I write to a filesystem, latency numbers from biosnoop look more realistic:

235.470488000  paranoid       19260  sdc     W  3921085456 4096      21.05
236.471146000  paranoid       19260  sdc     W  18344     4096       0.36
236.513255000  paranoid       19260  sdc     W  3921085464 4096      21.11
237.513787000  paranoid       19260  sdc     W  18344     4096       0.26
237.555953000  paranoid       19260  sdc     W  3921085472 4096      21.11

But then again, my app now sees 42ms latency because it has to write two sectors on disk.

This is happening on 4.4.44.

I also see similar weirdness on SSDs when reported latency is 0.09ms + 0.04ms (that's on a filesystem), but app perceives that as 1.2ms (0.09+0.04 is 0.13ms, 10x difference).

bobrik commented 7 years ago

If I do 4k writes with no pause between them:

$ sudo go run /tmp/paranoid.go -file=/dev/sdb -size=4096 -sleep=0ms
...
2017/03/30 04:29:40 Synced in 10031 μs | 10.031391ms
2017/03/30 04:29:40 Synced in 10050 μs | 10.050549ms
2017/03/30 04:29:41 Synced in 10040 μs | 10.04077ms
2017/03/30 04:29:41 Synced in 10054 μs | 10.054079ms
2017/03/30 04:29:41 Synced in 10032 μs | 10.032228ms
2017/03/30 04:29:41 Synced in 10036 μs | 10.036565ms
2017/03/30 04:29:41 Synced in 10047 μs | 10.04728ms
2017/03/30 04:29:41 Synced in 10021 μs | 10.021358ms
2017/03/30 04:29:41 Synced in 10064 μs | 10.064109ms
2017/03/30 04:29:41 Synced in 10032 μs | 10.032702ms
2017/03/30 04:29:41 Synced in 10007 μs | 10.007492ms
2017/03/30 04:29:41 Synced in 10277 μs | 10.277303ms

In biosnoop I can clearly see 10ms pause between issued writes:

67.830680000   paranoid       3676   sdb     W  6024      4096       0.66
67.840708000   paranoid       3676   sdb     W  6032      4096       0.65
67.850777000   paranoid       3676   sdb     W  6040      4096       0.65
67.860824000   paranoid       3676   sdb     W  6048      4096       0.64
67.870924000   paranoid       3676   sdb     W  6056      4096       0.65
67.880957000   paranoid       3676   sdb     W  6064      4096       0.65
67.891021000   paranoid       3676   sdb     W  6072      4096       0.67
67.901027000   paranoid       3676   sdb     W  6080      4096       0.64
67.911102000   paranoid       3676   sdb     W  6088      4096       0.65
67.921144000   paranoid       3676   sdb     W  6096      4096       0.64
67.931236000   paranoid       3676   sdb     W  6104      4096       0.65
67.941263000   paranoid       3676   sdb     W  6112      4096       0.63

Why kernel decides it's ok to have slower syncs when there are pauses between syncs is an interesting question too, but latencies from biosnoop are still not right.

Code I use:

package main

import (
    "flag"
    "log"
    "os"
    "time"
)

func main() {
    name := flag.String("file", "", "filename to use")
    size := flag.Int("size", 1024*10, "chunk size")
    sync := flag.Int("sync", 1, "sync file after that many writes, <= 0 to disable")
    sleep := flag.Duration("sleep", time.Second, "time to sleep between writes")
    count := flag.Int("count", 0, "number of iterations, <=0 for infinity")

    flag.Parse()

    if *name == "" {
        flag.PrintDefaults()
    }

    file, err := os.OpenFile(*name, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0666)
    if err != nil {
        log.Fatalf("Error opening file %q: %s", *name, err)
    }

    buf := make([]byte, *size)

    i := 0
    for {
        i++

        started := time.Now()

        verb := "Written"

        if _, err := file.Write(buf); err != nil {
            log.Fatalf("Error writing: %s", err)
        }

        if *sync > 0 && i%*sync == 0 {
            if err := file.Sync(); err != nil {
                log.Fatalf("Error syncing: %s", err)
            }

            verb = "Synced"
        }

        elapsed := time.Since(started)

        log.Printf("%s in %d μs | %v", verb, int(elapsed.Seconds()*1e6), elapsed)

        if *count > 0 && i >= *count {
            break
        }

        time.Sleep(*sleep)
    }
}
goldshtn commented 7 years ago

/cc @brendangregg

brendangregg commented 7 years ago

I issue 10 x 512 byte writes each followed by fsync directly to /dev/sdb that is a spinning disk and here's what I see:

biosnoop (0.20ms):

$ sudo /usr/share/bcc/tools/biosnoop | fgrep --line-buffered sdb
0.000000000    paranoid       17553  sdb     R  0         4096       0.23
0.000329000    paranoid       17553  sdb     W  0         4096       0.20
4.338511000    paranoid       17553  sdb     W  0         4096       0.27
5.361153000    paranoid       17553  sdb     W  0         4096       0.26
6.383771000    paranoid       17553  sdb     W  0         4096       0.26
7.406433000    paranoid       17553  sdb     W  0         4096       0.27
8.429084000    paranoid       17553  sdb     W  0         4096       0.26
9.451788000    paranoid       17553  sdb     W  0         4096       0.26
10.474409000   paranoid       17553  sdb     W  0         4096       0.26

Looks good to me.

It would be nice for it to include the flush events as well, but those kprobes don't in this case.

Edit: in case I'm not clear, I mean the biosnoop output should contain these as separate lines of output: one for the [W]rite event (0.25ms), and then the [S]ync event (10ms).

This reminds me of #826. Tracepoints cover the flush/sync events, and the bio tools should be rewritten to use them, at which point we'll just get both events for free. I haven't been in a hurry because it'll break everyone on 4.4, until they get to 4.7 (or discover the older bio tools in tools/old). So if the bio tools need minimum fixes for kprobes, we can continue doing that, and then later switch to tracepoints. But if the bio* tools become too badly broken, it might be better to bite the bullet and switch to tracepoints early...

I think I can guess why it's not seeing syncs: for syncs, blk_start_request() uses a different struct request *req, which doesn't get paired up on completion.

biolatency with queuing time (16-32ms):

Tracing block device I/O... Hit Ctrl-C to end.
^C
22:29:34

disk = 'sdb'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |
       128 -> 255        : 64       |****************************************|
       256 -> 511        : 9        |*****                                   |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 10       |******                                  |

In this case it's seeing both the writes and the syncs, and they are getting paired up correctly (same struct request on issue and completion).

biosnoop is missing the syncs. It's either fixable with tracepoints (#826) or finding the right kprobe.