s3gw-tech / s3gw

Container able to run on a Kubernetes cluster, providing S3-compatible endpoints to applications.
https://s3gw.tech
Apache License 2.0
145 stars 21 forks source link

Listing objects gets really slow when there's lots of objects #709

Open tserong opened 1 year ago

tserong commented 1 year ago

If I have one bucket, with 100,000 objects in it, and I use s3cmd la to list everything, it takes about 38 seconds:

> time s3cmd -c .s3cfg-test la s3://|wc -l
100001

real    0m37.753s
user    0m19.910s
sys 0m0.762s

If I have one bucket, with 400,000 objects in it, and I use s3cmd la to list everything, it takes about 6.5 minutes:

> time s3cmd -c .s3cfg-test la s3://|wc -l
400001

real    6m32.201s
user    1m30.180s
sys 0m2.498s

Convert that to seconds, and we get 392 seconds to list 400K objects vs 38 seconds to list 100K objects. That's a rather non-linear time increase.

I have no idea what's going on here, but having seen it felt I should record it as an area for further investigation.

jhmarina commented 1 year ago

We should test wether the S3 command is slow or wether it is happening on the server side

tserong commented 1 year ago

Well, it's not sqlite ;-)

# for 100k objects
> time sqlite3 s3gw.db 'select uuid, name from objects;'|wc -l
100000

real    0m0.035s
user    0m0.021s
sys     0m0.019s

# for 400k objects
> time sqlite3 s3gw.db 'select uuid, name from objects;'|wc -l
400000

real    0m0.113s
user    0m0.090s
sys     0m0.037s

I tested again using both s3cmd and mc, so at least I'm trying two different CLI tools. Here's what I got for 100k objects:

> time s3cmd -c .s3cfg-test la s3://|wc -l
100001

real    0m37.495s
user    0m20.488s
sys     0m0.606s

> time mc ls --recursive s3gw|wc -l
100000

real    0m23.088s
user    0m4.323s
sys     0m0.864s

...and here's what I got for 400k objects:

> time s3cmd -c .s3cfg-test la s3://|wc -l
400001

real    6m5.658s
user    1m25.723s
sys     0m2.521s

> time mc ls --recursive s3gw|wc -l
400000

real    4m56.893s
user    0m18.461s
sys     0m3.468s

Converted to seconds (and rounded), we get:

s3cmd - 100k objects: 37s; 400k objects: 367s (9.9x duration) mc - 100k objects: 23s; 400k objects: 297s (13x duration)

So mc is faster than s3cmd in both cases, but, both mc and s3cmd see a non-linear slowdown as the number of objects increase (if it was linear WRT the number of objects I'd expect to see 400k with s3cmd take 148s, and with mc, 92s).

irq0 commented 1 year ago

Well, it's not sqlite ;-)

Well, it is :)

Because, (1) Listing queries get slower with bucket size, (2) Pagination hurts

(1) The standard listing query is a bit more complicated than select all objects - it has to prefix search, support a continuation marker and filter out deleted / delete marked objects.

SELECT objects.name, versioned_objects.mtime, versioned_objects.etag, SUM(versioned_objects.size) FROM objects INNER JOIN  versioned_objects ON (objects.uuid = versioned_objects.object_id)  WHERE ( ((((versioned_objects.object_state = 1)) AND (objects.name > '')) AND objects.name LIKE 'ff%'))  GROUP BY versioned_objects.object_id  HAVING (MAX(versioned_objects.version_type) = 0)  ORDER BY objects.name LIMIT 1001;

Query plan:

QUERY PLAN
|--SEARCH objects USING INDEX object_bucketid_name (bucket_id=? AND name>?)
|--SEARCH versioned_objects USING INDEX vobjs_object_id_idx (object_id=?)
|--USE TEMP B-TREE FOR GROUP BY
`--USE TEMP B-TREE FOR ORDER BY

This means, we always process the whole bucket starting with the pagination marker (name>). The group by removes delete marked. Order by does the mandatory sorting. The limit merely limits the result set. Queries actually get faster the closer the pagination marker is to the end.

Note: We don't prefix search in this query. This is very common when having folder-esque object names and limits the result set further.

(2) With a 100000 object bucket I see 200ms at the beginning going down to > 10ms on my machine. With SQLite profiling enabled (--rgw-sfs-sqlite-profile 1)

$ awk '/SLOW QUERY/ { printf "%s ", $8; gsub("ms","", $8); sum+=$8; } END { printf "\n"; print "total", sum "ms" } ' last.log

195ms 192ms 192ms 188ms 189ms 184ms 184ms 186ms 184ms 189ms 187ms 176ms 176ms 175ms 169ms 180ms 167ms 174ms 164ms 166ms 159ms 160ms 159ms 154ms 156ms 151ms 152ms 151ms 143ms 145ms 142ms 140ms 139ms 137ms 136ms 131ms 129ms 128ms 126ms 125ms 121ms 122ms 119ms 125ms 115ms 112ms 112ms 108ms 107ms 107ms 103ms 110ms 100ms 98ms 91ms 97ms 91ms 92ms 91ms 88ms 86ms 83ms 84ms 79ms 78ms 77ms 75ms 74ms 72ms 69ms 68ms 66ms 63ms 62ms 60ms 57ms 56ms 55ms 52ms 52ms 49ms 46ms 45ms 43ms 40ms 40ms 40ms 35ms 34ms 32ms 29ms 27ms 25ms 23ms 22ms
total 10487ms

Op time is dominated by query time. Op latency in the logs suggest about 10ms on top of the query time.

(2) Pagination hurts, because we don't pay for the query once, but every 1000 objects.


For the above result s3cmd took about a minute. RGW accounts for roughly 12s. There however seems to be a delay between requests of (eyeballing from logs) 400ms -> 40s. That may account for the s3cmd / mc difference.


In summary, listing unstructured object key space is expensive. It gets better using prefixes and the common delimiter logic.

tserong commented 1 year ago

Well, it's not sqlite ;-)

Well, it is :)

Good to see one of knows how this thing works ;-P Thanks for the detailed explanation!

jecluis commented 1 year ago

@irq0 what do we want to do with this? push it farther down the line? Do we think this is reasonable to be address at some point, or is it to be written off as an implementation limitation?

irq0 commented 1 year ago

Suggest we push it until we have a proper target performance metric to work towards. Listing really big buckets isn't fast, but it is not prohibitively - as in requests fail - slow either.

Note for the future: We could limit search space via <= 'rowid'. This would be a heuristic, but according to "spec" we are allowed to return less than max even if we could. This could bring down the query time at the beginning of the bucket.

jecluis commented 1 year ago

Cool, pushing it then.