Avoid performance peaks and drops for s3:ListObjectsV2 on nasxl

aweisser commented 2 years ago

With commits f8c2946 and bd82a93 we passed the ContinuationToken as a ForwardTo marker to the underlying WalkDir function, to speed up the s3:ListObjectsV2 operation. This fix improved the listing performance. But there are still peaks and performance drops when you add more and more objects, as you can see in the following plot.

The red line is AWS S3. You can see it scales linearly.

The blue line is the nasxl code without the ContinuatonTokenoptimisation, that was introduced in commits f8c2946and bd82a93. It looses performance every 4500 objects.

The yellow line is nasxl with ContinuatonToken optimisation. The performance loss at each 4500 objects is gone. But still there peaks and a huge performance drop at ~16000 objects.

grafik

The numbers are generated with the follwing test script:

#!/bin/bash

# 1. Make sure that you've created an empty test bucket (e.g. ls-test) and an mc alias (e.g. nasxl)
# 2. Call the script like this: ./measure-ls.sh nasxl/ls-test 1 200 50
# This will create a folder structure with one level (folders named 1 - 200) containing 50 objects in each folder.
# The script will generate two .csv files for PUT object and ListObjects latencies (in secs)

trap "exit" INT

TIMEFORMAT=%R

bucket=$1
min_folders=$2
max_folders=$3
files_per_folder=$4
outfile_prefix=`(echo $bucket | tr '/' '.')`
mc="./mc"

touch file

for i in $(seq $min_folders $max_folders); do
    for j in $(seq 1 $files_per_folder); do
        echo `(time $mc cp file $bucket/$i/file$j --quiet --insecure >>/dev/null) 2>&1` >> $outfile_prefix.PUT.$min_folders.$max_folders.$files_per_folder.csv
    done
    echo `(time $mc ls $bucket --recursive --insecure | wc -l) 2>&1` |  tr ' ' ';' >> $outfile_prefix.LIST.$min_folders.$max_folders.$files_per_folder.csv
done

Expected Behavior

Listing operation should scale linearly like AWS S3. The gradient however doesn't need to be the same as AWS S3.

tristanexsquare commented 2 years ago

Test with AWS EC2 t2.medium instance using EBS storage type gp2 did not produce the same behaviour as described. The response time of ListObjectsV2 scaled linearly: ec2-mynasxl.ls-test.LIST.1.2000.50.csv

aweisser commented 2 years ago

@tristanessquare: Please try to reproducde this behavior on the provided test system (full stack).

aweisser commented 2 years ago

Prio 1 is to reproduce the behavior and identify the root cause for this performance drop at ~16k objects (issue #21). Goal is to ensure a linear scaling.

Prio 2 is to further optimize the listing performance/reduce the slope of this linear graph (issue #20).

tristanexsquare commented 2 years ago

I could reproduce the same behaviour with Minio Version "no-trash-continuationmarker" on the Testenvironment. To analyze the root cause for the peak I created a bucket under with 15800 objects and another bucket over with 16400 objects. Then I started a profiling for a list-objects call to each of these buckets. You can find the results in the attached files. List-objects for under took about 4 seconds, the same call to over took 10 seconds.

profile-under.zip profile-over.zip

aweisser commented 2 years ago

@tristanessquare: Can you identify the root cause (at the Minio level) for this?

tristanexsquare commented 2 years ago

Not yet, we are working on it. I just wanted to update this issue to reflect our current state of work.

tristanexsquare commented 2 years ago

We have also looked into the nas mode and found the same behaviour - a sudden peak depending on the number of objects and more and more peaks in the middle. Comparing nas-legacy and nas we found that the duration of a list-objects call has nearly doubled from nas to nas-legacy, but the pattern nonethless is the same for nas, nas-legacy and nasxl.

nas-legacy

nas

Our suspicion was that the root cause is some kind of glusterfs limit which is reached when reading all of the metafiles in the over-bucket. To verify this suspicion we tried to access the files directly on the filesystem with find, grep and cat. But the access times scaled linearly and did not show the expected peaks.

As we could not explain the root cause of the peaks at certain numbers of objects by looking at the profiles from the previous post, we have used the madmin tool to create tracing-profiles. These profiles have been taken while calling mc ls 3 times on each of the two nasxl buckets (over and under). The mc ls call to over took about 15 seconds and the call to under took about 5 seconds.

icas-nasxl-under-profiling.zip icas-nasxl-over-profiling.zip

gergan commented 2 years ago

We have tried to identify the causes for the slow down unsuccessfully with the CPU Profiler and the Trace profiler has given us hardly any insights. The file access times in the over case are somewhat slower, but there are not enough to explain the difference between the runs. We have found an article about the shortcomings of the builtin profilers, probably we could try to use the proposed alternative to try to pinpoint the problem? We found one more alternative profiler, which probably could be helpful.

aweisser commented 2 years ago

I found a 16k constant. Although this is about the blockSize of the metacacheWriter. But still: This is something that could be adjusted... grafik

Another problem could be that the output writer is initiated with writer concurrency of 2 (also a magic number in the code) instead of GOMAXPROCS. grafik

This has been changed by Klaus Post on November 2020. https://github.com/iternity-dotcom/minio/commit/0422eda6a261010e0387307c2ea84b298f9df38f

aweisser commented 2 years ago

Another optimisation could be to replace the home grown readDirN(string, int) and readDir(string) calls with a standard os.ReadDir(string), which is available since go 1.16.2.

grafik

The implementation of os.ReadDir(string) also already return a sorted result. So the caller doesn't need to sort... grafik

The count paramater of readDirN(string, int) should correspond to the -1 in f.ReadDir(-1). We can easily provide a func ReadDir(name string, count int) function to support this.

tristanexsquare commented 2 years ago

We have repeated the profiling with fgprof. Still we could not find the root cause for the drop:

profile-nasxl-over-fgprof.zip profile-nasxl-under-fgprof.zip

We have also tried to directly access the fuse mount and list/read all files. But we could not reproduce the performance drop with this.

l3support@fs-essquare-n1:/gluster/repositories/fsnoworm-s3/fsnoworm-s3-default> time find ls-test-under -type f -exec cat {} + > /dev/null

real    0m14.168s
user    0m0.128s
sys 0m1.063s

l3support@fs-essquare-n1:/gluster/repositories/fsnoworm-s3/fsnoworm-s3-default> time find ls-test-over -type f -exec cat {} + > /dev/null

real    0m17.267s
user    0m0.104s
sys 0m1.173s

tristanexsquare commented 2 years ago

Today we started to manually track down the root cause for the peak by adding more logging. The attached file shows all of our runs in chronological order.

We found out that the peak is caused by os.OpenFile(). This call is much slower for the over bucket compared to the under bucket.

iternity-profiling-protocol.txt

aweisser commented 2 years ago

If you take a look at the implementation of os.OpenFile you‘ll notice that the openFileNolog(3) uses a loop to retry the syscall.Open(3) until it doesn’t return an EINTR error code.

The EINTR error code is thrown if a signal occurred while the system call was in progress.

So there are two possibilities why the os.OpenFile could be slow.

The syscall.Open(3) itself is suddenly getting slow.
The syscall.Open(3) suddenly throws a lot of EINTR errors, thus it’s „blocked“ by other signals.

aweisser commented 2 years ago

@rafikc30 Can you pls. throw some GlusterFS wisdom at us!?

When doing a s3:ListObjectsV2 on 16400 files the os.OpenFile() calls (used implicitly) are 2-3 times slower compared to a listing of 15800 files.

gergan commented 2 years ago

So the script is still running, but we have enough data to see some jumps in the time needed for opening/reading the files. The really big fluctuations begin around 20000 after a relatively hefty slow down. Screenshot 2022-05-08 at 20 50 34 The script Is basically a time on a find for all xl.meta files, which are than subsequently cat-ed. So the find times are also included, but these should be negligible, if they behave as observed in minio.

aweisser commented 2 years ago

@gergan Can you pls. share the exact command, so that we can reproduce it!? Thanks.

gergan commented 2 years ago

Hi, it is slightly modified version of the measure-ls.sh script. It could be found on the server under /home/l3support/client:

#!/bin/bash

# 1. Make sure that you've created an empty test bucket (e.g. ls-test) and an mc alias (e.g. nasxl)
# 2. Call the script like this: ./measure-ls.sh nasxl/ls-test 1 200 50 /mount/to/gluster/ls-test
# This will create a folder structure with one level (folders named 1 - 200) containing 50 objects in each folder.
# The script will generate two .csv files for PUT object and ListObjects latencies (in secs)

trap "exit" INT

TIMEFORMAT=%R

bucket=$1
min_folders=$2
max_folders=$3
files_per_folder=$4
bucket_mount=$5
outfile_prefix=`(echo $bucket | tr '/' '.')`
mc="mc"

touch file

object_count=0
for i in $(seq $min_folders $max_folders); do
    for j in $(seq 1 $files_per_folder); do
        echo `(time $mc cp file $bucket/$i/file$j --quiet --insecure >>/dev/null) 2>&1` >> $outfile_prefix.PUT.$min_folders.$max_folders.$files_per_folder.csv
    done
    ((object_count=object_count+files_per_folder))
    echo `(echo $object_count && time find $bucket_mount -type f -name 'xl.meta' -exec cat {} > /dev/null \;) 2>&1` |  tr ' ' ';' >> $outfile_prefix.LIST.$min_folders.$max_folders.$files_per_folder.csv
done

rafikc30 commented 2 years ago

@aweisser Gluster open call is very unlikely to return ENITR on this occasion. While technically the open file system call can return EINTR when the blocked open call is interrupted by a signal handler.

aweisser commented 2 years ago

@rafikc30, @gergan , @tristanessquare: I wrote a little test routine based on a simplified version of openFileNolog(3). I would expect that the same pattern (performance drop at a certain number of objects) can be reproduced on any nasxl S3 space.

Follow these steps:

Step 1: Create a openFileNolog.go file:

package main

import (
    "fmt"
    "os"
    "syscall"
)

// Pass a file name as first argument
func main() {
    name := os.Args[1]
    eintr_cnt, _, _ := openFileNologSimplified(name, os.O_RDONLY, 0)
    fmt.Println(eintr_cnt)
}

// Simplified version of the Unix implementation of os.openFileNolog.
// It mainly lacks support for sticky bit and platforms where "close on exit" is not supported.
func openFileNologSimplified(name string, flag int, perm uint32) (int, int, error) {
    var r int
    c := 0
    for {
        var e error
        r, e = syscall.Open(name, flag|syscall.O_CLOEXEC, perm)
        if e == nil {
            break
        }
        // We have to check EINTR here, per issues 11180 and 39237.
        if e == syscall.EINTR {
            c++
            continue
        }
        return c, r, e

    }

    return c, r, nil
}

Step 2: Build it with go build openFileNolog.go

Step 3: Create a measure-openFileNolog.sh file:

#!/bin/bash

# 1. Make sure that you've created an empty test bucket (e.g. ls-test) and an mc alias (e.g. nasxl)
# 2. Call the script like this: ./measure-openFileNolog.sh nasxl/ls-test 1 200 50 /mount/to/gluster/ls-test
# This will create a folder structure with one level (folders named 1 - 200) containing 50 objects in each folder.
# The script will generate two .csv files for PUT object and ListObjects latencies (in secs)

trap "exit" INT

TIMEFORMAT=%R

bucket=$1
min_folders=$2
max_folders=$3
files_per_folder=$4
bucket_mount=$5
outfile_prefix=`(echo $bucket | tr '/' '.')`
mc="mc"

touch file

object_count=0
for i in $(seq $min_folders $max_folders); do
    for j in $(seq 1 $files_per_folder); do
        echo `(time $mc cp file $bucket/$i/file$j --quiet --insecure >>/dev/null) 2>&1` >> $outfile_prefix.PUT.$min_folders.$max_folders.$files_per_folder.csv
    done
    ((object_count=object_count+files_per_folder))
    echo `(echo $object_count && time find $bucket_mount -type f -name 'xl.meta' -exec ./openFileNolog {} \; | awk '{s+=$1} END {print s}' | tr '\n' ' ') 2>&1` |  tr ' ' ';' >> $outfile_prefix.OPEN.$min_folders.$max_folders.$files_per_folder.csv
done

Step 4: Create an empty bucket on a nasxl target mc mb nasxl/test20000 --insecure

Step 5: Create 400 folders with 50 0-Byte files each and measure the openFileNolog function. (Make sure to replace <repo> and <space> accordingly!) ./measure-openFileNolog.sh nasxl/test20000 1 400 50 /gluster/repositories/<repo>/<space>/test20000 &

Step 6: Wait for the .csv output to be created.

tail -f nasxl.test20000.OPEN.1.400.50.csv
...
6150;0;47.509
6200;0;42.505
6250;0;43.542
6300;0;41.995

The columns are:

object count
Sum of EINTR errors for all <object count> calls to openFileNolog
Total latency in secs

aweisser commented 2 years ago

I'm currently running the new test routine on one of my test VMs and the openFileNolog calls also seem to drop at 8650 objects (with the known factor of 2).

8400;0;78.794
8450;0;82.466
8500;0;77.508
8550;0;78.773
8600;0;83.723
8650;0;168.264
8700;0;222.027
8750;0;200.916

However, the EINTR error has not occured until now, so the plain syscall.Open(name, flag|syscall.O_CLOEXEC, perm) is responsable for this behaviour. @rafikc30: fyi

aweisser commented 2 years ago

Here's a visualisation of my first test run. Other than suspected the pattern is not that clear. The performance instantly decreases very much at some points but unlike the initial observation with s3:ListObjectsV2 the total duration recovers again.

grafik

I'll have to rerun the test to verify. In the first test run I also had some problems with my test system. So maybe this result is not representative.

rluetzner commented 2 years ago

I've modified the code of openFileNolog.go so it traverses the files directly (no find necessary) and times the open operations directly. The durations are summed up and output as <num-files>;<open-duration>;<total-duration>. fyi: @aweisser

package main

import (
    "fmt"
    "io/fs"
    "os"
    "path/filepath"
    "syscall"
    "time"
)

var totalOpenTime time.Duration = time.Duration(0)
var totalFiles int = 0

// Pass a file name as first argument
func main() {
    start := time.Now()
    name := os.Args[1]
    filepath.WalkDir(name, visit)
    totalTime := time.Since(start)
    fmt.Printf("%d;%s;%s\n", totalFiles, totalOpenTime, totalTime)
}

func visit(path string, d fs.DirEntry, err error) error {
    isDir, _ := isDirectory(path)
    if !isDir {
        duration := openFile(path)
        totalOpenTime += duration
        totalFiles++
    }
    return nil
}

// isDirectory determines if a file represented
// by `path` is a directory or not
func isDirectory(path string) (bool, error) {
    fileInfo, err := os.Stat(path)
    if err != nil {
        return false, err
    }

    return fileInfo.IsDir(), err
}

func openFile(name string) time.Duration {
    duration, r, _ := openFileNologSimplified(name, os.O_RDONLY, 0)
    // Close is done here, to ensure that we measure ONLY the
    // syscall.Open() duration in openFileNologSimplified().
    // We should ensure that filehandles are closed to
    // 1. prevent memory leaks and
    // 2. because GlusterFS has a fixed upper limit of how many
    //    fds can be open.
    defer syscall.Close(r)
    return duration
}

// Simplified version of the Unix implementation of os.openFileNolog.
// It mainly lacks support for sticky bit and platforms where "close on exit" is not supported.
func openFileNologSimplified(name string, flag int, perm uint32) (time.Duration, int, error) {
    start := time.Now()
    var r int
    c := 0
    for {
        var e error
        r, e = syscall.Open(name, flag|syscall.O_CLOEXEC, perm)
        if e == nil {
            break
        }
        // We have to check EINTR here, per issues 11180 and 39237.
        if e == syscall.EINTR {
            c++
            continue
        }
        return time.Since(start), r, e

    }

    return time.Since(start), r, nil
}

aweisser commented 2 years ago

Here's my rerun on a new test cluster using v.1.12.dev and a isolated measurement of the syscall.Open(3) only. My recent test run posted above is not representative as it also measures the duration of the find. Though it's obsolete.

According to me new test run the syscall.Open(3) is NOT causing the performance drops of the s3:ListObjectsV2 operation (triggerd by a mc ls command). At least not when using v.1.12.dev.

As you can see the pattern of the mc ls command (orange) is similar to the one we observed in the beginning: The total duration drops every 4500 objects. This is because the Minio version in v.1.12.dev is not using the "continuation marker" optimization by default.

The isolated openFileNolog command (green), however, doesn't show those persistent performance drops. The duration goes up and down and is slightly increasing over time. But it's not the same "stairs" pattern like we see in the different mc ls tests.

grafik

aweisser commented 2 years ago

Next steps:

Retry openFileNolog with your little go routine @rluetzner
Retry mc ls using an optimized version of Minio (with continutation marker fix).
Review the Minio code again to identify other, possible reasons for the persistent performance drops. Maybe the problem is not triggered by openFileNolog alone, but only in combination with other FOPs that are applied during a complex mc ls operation.

rluetzner commented 2 years ago

I've ripped out the WalkDir algorithm and any related code from MinIO and put it into a small Go program. The code is available here: https://github.com/rluetzner/minio-list-debug .

rluetzner commented 2 years ago

I found an interesting Gluster setting: network.inode-lru-limit is set to 16,348 by default. When it is increased to 19,000 the jump in the curve moves noticably. After 19,000 there is the same shift in listing duration as we've seen before. I retried the same with a value of 0, which stays nicely linear 'til the end.

These curves are listing results from the tool I built from the extracted MinIO WalkDir function.

plot

However if I understand the Gluster code directly, this setting cannot be turned off and has a hard limit of 32,768 (32*1024 a.k.a. DEFAULT_INODE_MEMPOOL_ENTRIES). This is confusing, because the Gluster docs specifically advise to set this value to 50,000. I've also seen articles mentioning even 200,000. My C is pretty bad, but if I understand the following code correctly, the value can only be set between 1 and DEFAULT_INODE_MEMPOOL_ENTRIES (values less than 0 are already handled elsewhere and mapped to 16,348).

// excerpt from libglusterfs/src/inode.c from GlusterFS:
if (!mem_pool_size || (mem_pool_size > DEFAULT_INODE_MEMPOOL_ENTRIES))
        mem_pool_size = DEFAULT_INODE_MEMPOOL_ENTRIES;

One could argue that the Gluster docs are outdated or the code is wrong, but the same code above can be found in GlusterFS upstream and has been last modified 4 years ago. Maybe it's worth asking some Gluster people about this mismatch of information.

rluetzner commented 2 years ago

plot-large

I ran the tests longer and with more files. Anything with (more files) in the graph was recorded with

# Create 400 folders with 500 files per folder, i.e. 200,000 files.
./measure-openFileNolog.sh nasxl/test8 1 400 500 /gluster/repositories/fsnoworm-s3/fsnoworm-s3/test8/

Surprisingly setting network.inode-lru-limit to 0 behaves differently to setting the value to 200,000. The first jump in the curve occurs at a similar point, but in almost all cases the green line stays below the orange one. With the knowledge from above I would have expected the jumps in both curves to occur for multiples of 32,768. They do jump around 65,000, but not consistently and surprisingly not around 32,000 files.

I wanted to check if the curves converge for higher file counts, no matter what value is configured. That's why the pink line is there. The curves do converge in a way, but it looks like there are still better and worse values. Most of the time, the green curve would be what we're looking for.

Additionally, for now it looks like there is no way to get rid of these jumps entirely. I suspect that this is due to the cache implementation. When the memory pool starts hitting it's limit we encounter something that looks very similar to garbage collection in other languages. Execution time slows down to allow for data that is no longer needed to be removed from the mem pool.

Next I want to re-run the test with a setting of 0 to see if the results are similar to the first execution. I'm still surprised that the curves for 0 and 200,000 differ from each other and I want to check if this might be caused by there being a larger amount of files overall on the system.

If my assumption is correct, than a second run with a value of 0 should show a closer resemblance to the orange curve rather than the green one.

rluetzner commented 2 years ago

plot-with-rerun

Well...this is where expectations meet reality. The black and green graphs in this plot were recorded with the same network.inode-lru-limit and look nothing alike. In fact even my assumption that the (now) green curve would be closer to the (now) blue curve is not accurate.

The green curve jumps around 32,000 instead of 60,000 where all the other curves did.

Right now the only thing we can gather is that Gluster's default setting is not ideal for large amounts of small files.

To see if this is caused by the increasing amount of files in the backend or if this is indeed arbitrary, I'll have to setup a new system and run the script multiple times for different buckets.

I also want to extract the Gluster code I posted above and didn't fully understand into a small C file and compile it to see what it actually does.

rluetzner commented 2 years ago

I've written a short piece of C code to test my theory. network.inode-lru-limit is indeed clamped between 0 and 32,768. I've opened up a ticket with Gluster Docs (https://github.com/gluster/glusterdocs/issues/753) because they contain misleading pages that mention setting the value to 50,000 or 200,000. I hope to get this cleared up soon.

Here's the code I used to test my theory:

#include <stdio.h>

const int DEFAULT_INODE_MEMPOOL_ENTRIES  = 32*1024;

int main()
{
    int mem_pool_size = 0;
    printf("Enter a Number: ");
    scanf("%d", &mem_pool_size);
    if (!mem_pool_size || (mem_pool_size > DEFAULT_INODE_MEMPOOL_ENTRIES))
    {
        mem_pool_size = DEFAULT_INODE_MEMPOOL_ENTRIES;
        printf("Using the default size.");
    }
    else
        printf("Not using the default size.");
    return 0;
}

And for anyone that is like me and has to look up how to compile C code:

Check if you have gcc installed: gcc -v
Compile the code: gcc main.c -o main
Run the program: ./main

Results:

Input Value	Output
0	Using the default size.
1	Not using the default size.
32768	Not using the default size.
32769	Using the default size.

rluetzner commented 2 years ago

I've had a very interesting discussion with Rafi about this.

inode mem pool initialization

I learned that what I saw in the code was merely where the inode mem pool is initialized. Whenever the inode-lru-limit is 0 or higher than DEFAULT_INODE_MEMPOOL_ENTRIES the mem pool is initialized with space for 32 * 1024 inodes. When the limit is reached the mem pool dynamically increases.

There is a separate function called inode_table_prune in libglusterfs/src/inode.c that ensures that the mem pool stays inside the configured limit.

The struct containing the mem pool also contains a field lru_limit, which holds the correct value. It is therefore very much possible to set values higher than 32 * 1024. I've since closed the issue in the Gluster docs, because I merely misunderstood the code.

client-side vs. server-side caching

There is a separate inode cache on the client-side (the FUSE mount). The network.inode-lru-limit setting I found only concerns itself with the server-side (brick processes) and has a lesser impact according to Rafi. His assumption was that we would observe a way higher jump in the graph around 64k (which is the default size of the client-side inode cache). This is very much visible for the pink line in the plot from this comment.

The next steps would therefore be to increase the server-side cache to values beyond 65k and adjust the client-side lru-cache accordingly. We have to keep in mind that the higher we go, the more memory we will allocate on the systems.

According to Rafi, the client-side setting is a mount option called lru-limit and can either be added via -o when mounting the Gluster volume manually, using glusterfs -s <hostname> --volfile-id <volname> --lru-limit <n> <mount-path> or simply added to /etc/fstab along with the other options we are already using.

rluetzner commented 2 years ago

The following graph was recorded with

./measure-openFileNolog.sh "nasxl/$bucket" 1 40 500 "/gluster/repositories/fsnoworm-s3/fsnoworm-s3/$bucket/"

so the points are 500 files apart (fewer measurements than before). They also stop at 20,000 so we don't see a long-term trend. I replaced the tool I built from the extracted listing algorithm with calls to mc ls --recursive --insecure --quiet $bucket. The Gluster lru-caches on the FUSE mount and the bricks have both been configured as 200,000.

plot-new-minio

The result surprised me quite a bit. Even though we don't see the long-term trend it is obvious that the official MinIO binary with an XL backend performs much better at listing files compared to our forked code. I'm also surprised how little the curves deviate from each other. Maybe this would be different with more points (i.e. less files per folder, but more folders to list), but I would expect the trend to be similar.

Right now I'm recording a measurement up to 200,000 files to see how the official MinIO binary performs with higher file counts.

iternity-dotcom / minio