Open tserong opened 1 year ago
We should test wether the S3 command is slow or wether it is happening on the server side
Well, it's not sqlite ;-)
# for 100k objects
> time sqlite3 s3gw.db 'select uuid, name from objects;'|wc -l
100000
real 0m0.035s
user 0m0.021s
sys 0m0.019s
# for 400k objects
> time sqlite3 s3gw.db 'select uuid, name from objects;'|wc -l
400000
real 0m0.113s
user 0m0.090s
sys 0m0.037s
I tested again using both s3cmd
and mc
, so at least I'm trying two different CLI tools. Here's what I got for 100k objects:
> time s3cmd -c .s3cfg-test la s3://|wc -l
100001
real 0m37.495s
user 0m20.488s
sys 0m0.606s
> time mc ls --recursive s3gw|wc -l
100000
real 0m23.088s
user 0m4.323s
sys 0m0.864s
...and here's what I got for 400k objects:
> time s3cmd -c .s3cfg-test la s3://|wc -l
400001
real 6m5.658s
user 1m25.723s
sys 0m2.521s
> time mc ls --recursive s3gw|wc -l
400000
real 4m56.893s
user 0m18.461s
sys 0m3.468s
Converted to seconds (and rounded), we get:
s3cmd
- 100k objects: 37s; 400k objects: 367s (9.9x duration)
mc
- 100k objects: 23s; 400k objects: 297s (13x duration)
So mc
is faster than s3cmd
in both cases, but, both mc
and s3cmd
see a non-linear slowdown as the number of objects increase (if it was linear WRT the number of objects I'd expect to see 400k with s3cmd
take 148s, and with mc
, 92s).
Well, it's not sqlite ;-)
Well, it is :)
Because, (1) Listing queries get slower with bucket size, (2) Pagination hurts
(1) The standard listing query is a bit more complicated than select all objects - it has to prefix search, support a continuation marker and filter out deleted / delete marked objects.
SELECT objects.name, versioned_objects.mtime, versioned_objects.etag, SUM(versioned_objects.size) FROM objects INNER JOIN versioned_objects ON (objects.uuid = versioned_objects.object_id) WHERE ( ((((versioned_objects.object_state = 1)) AND (objects.name > '')) AND objects.name LIKE 'ff%')) GROUP BY versioned_objects.object_id HAVING (MAX(versioned_objects.version_type) = 0) ORDER BY objects.name LIMIT 1001;
Query plan:
QUERY PLAN
|--SEARCH objects USING INDEX object_bucketid_name (bucket_id=? AND name>?)
|--SEARCH versioned_objects USING INDEX vobjs_object_id_idx (object_id=?)
|--USE TEMP B-TREE FOR GROUP BY
`--USE TEMP B-TREE FOR ORDER BY
This means, we always process the whole bucket starting with the pagination marker (name>). The group by removes delete marked. Order by does the mandatory sorting. The limit merely limits the result set. Queries actually get faster the closer the pagination marker is to the end.
Note: We don't prefix search in this query. This is very common when having folder-esque object names and limits the result set further.
(2) With a 100000 object bucket I see 200ms at the beginning going down to > 10ms on my machine. With SQLite profiling enabled (--rgw-sfs-sqlite-profile 1
)
$ awk '/SLOW QUERY/ { printf "%s ", $8; gsub("ms","", $8); sum+=$8; } END { printf "\n"; print "total", sum "ms" } ' last.log
195ms 192ms 192ms 188ms 189ms 184ms 184ms 186ms 184ms 189ms 187ms 176ms 176ms 175ms 169ms 180ms 167ms 174ms 164ms 166ms 159ms 160ms 159ms 154ms 156ms 151ms 152ms 151ms 143ms 145ms 142ms 140ms 139ms 137ms 136ms 131ms 129ms 128ms 126ms 125ms 121ms 122ms 119ms 125ms 115ms 112ms 112ms 108ms 107ms 107ms 103ms 110ms 100ms 98ms 91ms 97ms 91ms 92ms 91ms 88ms 86ms 83ms 84ms 79ms 78ms 77ms 75ms 74ms 72ms 69ms 68ms 66ms 63ms 62ms 60ms 57ms 56ms 55ms 52ms 52ms 49ms 46ms 45ms 43ms 40ms 40ms 40ms 35ms 34ms 32ms 29ms 27ms 25ms 23ms 22ms
total 10487ms
Op time is dominated by query time. Op latency in the logs suggest about 10ms on top of the query time.
(2) Pagination hurts, because we don't pay for the query once, but every 1000 objects.
For the above result s3cmd took about a minute. RGW accounts for roughly 12s. There however seems to be a delay between requests of (eyeballing from logs) 400ms -> 40s. That may account for the s3cmd / mc difference.
In summary, listing unstructured object key space is expensive. It gets better using prefixes and the common delimiter logic.
Well, it's not sqlite ;-)
Well, it is :)
Good to see one of knows how this thing works ;-P Thanks for the detailed explanation!
@irq0 what do we want to do with this? push it farther down the line? Do we think this is reasonable to be address at some point, or is it to be written off as an implementation limitation?
Suggest we push it until we have a proper target performance metric to work towards. Listing really big buckets isn't fast, but it is not prohibitively - as in requests fail - slow either.
Note for the future: We could limit search space via <= 'rowid'. This would be a heuristic, but according to "spec" we are allowed to return less than max even if we could. This could bring down the query time at the beginning of the bucket.
Cool, pushing it then.
If I have one bucket, with 100,000 objects in it, and I use
s3cmd la
to list everything, it takes about 38 seconds:If I have one bucket, with 400,000 objects in it, and I use
s3cmd la
to list everything, it takes about 6.5 minutes:Convert that to seconds, and we get 392 seconds to list 400K objects vs 38 seconds to list 100K objects. That's a rather non-linear time increase.
I have no idea what's going on here, but having seen it felt I should record it as an area for further investigation.