Closed votdev closed 12 months ago
I see the query string for the command is different when running the python script and the aws cli
.
With the python script:
2023-11-24T14:46:11.359+0100 7efd419496c0 20 QUERY_STRING=versions&prefix=&delimiter=%2F&key-marker=&encoding-type=url
With aws cli
2023-11-24T14:49:28.599+0100 7efd3c13e6c0 20 QUERY_STRING=versions&prefix=&encoding-type=url
Difference is that the python tool is passing a delimiter, while the aws cli
command is not.
Adding extra info:
I can see 2 queries issued for the python script while 1 for the aws cli
command.
Adding: --delimiter /
to the aws cli
command triggers 2 queries to the backend.
It does not make any difference if the bucket is versioned or not.
When running with --delimiter "/"
all objects are duplicated as well.
$ aws s3api list-object-versions --endpoint=http://127.0.0.1:7480 --bucket "test01" --prefix "" --delimiter "/"
{
"Versions": [
{
"ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
"Size": 26,
"StorageClass": "STANDARD",
"Key": "test",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.129Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
},
{
"ETag": "\"13e90c08734c1fb6faacf5cc62968a9d-1\"",
"Size": 205,
"StorageClass": "STANDARD",
"Key": "test1",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.125Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
},
{
"ETag": "\"66f0fb190ead6b1e2d1e880b0ce71795-1\"",
"Size": 654,
"StorageClass": "STANDARD",
"Key": "test2",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.129Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
},
{
"ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
"Size": 26,
"StorageClass": "STANDARD",
"Key": "test",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.129Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
},
{
"ETag": "\"13e90c08734c1fb6faacf5cc62968a9d-1\"",
"Size": 205,
"StorageClass": "STANDARD",
"Key": "test1",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.125Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
},
{
"ETag": "\"66f0fb190ead6b1e2d1e880b0ce71795-1\"",
"Size": 654,
"StorageClass": "STANDARD",
"Key": "test2",
"VersionId": "null",
"IsLatest": true,
"LastModified": "2023-11-24T14:23:35.129Z",
"Owner": {
"DisplayName": "M. Tester",
"ID": "testid"
}
}
],
"CommonPrefixes": [
{
"Prefix": "b/"
}
]
}
The issue is caused by the following logic:
if (!results.common_prefixes.empty()) {
std::vector<rgw_bucket_dir_entry> objects_after;
std::string query = std::prev(results.common_prefixes.end())->first;
query.append(
sfs::S3_MAX_OBJECT_NAME_BYTES - query.size(),
std::numeric_limits<char>::max()
);
list.objects(get_bucket_id(), params.prefix, query, 1, objects_after);
results.is_truncated = objects_after.size() > 0;
}
That is called when we have common prefixes involved, like in this case.
Problem is that logic for SQLiteList::objects
is based on sorting by name only...
This is the query:
auto rows = storage->select(
columns(
&DBObject::name, &DBVersionedObject::mtime, &DBVersionedObject::etag,
sum(&DBVersionedObject::size)
),
inner_join<DBVersionedObject>(
on(is_equal(&DBObject::uuid, &DBVersionedObject::object_id))
),
where(
is_equal(&DBVersionedObject::object_state, ObjectState::COMMITTED) and
is_equal(&DBObject::bucket_id, bucket_id) and
greater_than(&DBObject::name, start_after_object_name) and
prefix_to_like(&DBObject::name, prefix)
),
group_by(&DBVersionedObject::object_id)
.having(is_equal(
sqlite_orm::max(&DBVersionedObject::version_type),
VersionType::REGULAR
)),
order_by(&DBObject::name), limit(query_limit)
);
The first call adds the following items:
As we're using delimiter /
, b/
is considered a common prefix and the logic above is triggered.
When the code above is called parameters are:
prefix = ""
start_after_object_name = "b/"
max = 1
But the objects test
, test1
, and test2
have a greater name than b/
so the first of them is obtained from the database.
The last statement in the code above
results.is_truncated = objects_after.size() > 0;
results in results.is_truncated = true
.
This makes rgw
to trigger another call this time calling list versions with
start_after_object_name = "b/"
And, again... objects test
, test1
, and test2
have a greater name than b/
, so they're re-added.
The s3tests
test_bucket_listv2_delimiter_prefix
and test_bucket_list_delimiter_prefix
pass because there're both requesting for 1 single object every time.
I think we'd need some extra logic that is not only based on name order. What we need is to find out if we have something else after the common prefix, but that was not added before, to determine if the result is truncated or not.
When calling the AWS S3 API function list_object_versions the response of a really simple bucket with 4 objects is truncated. IMO this should be only one call because the max keys of 1000 is not reached. Because of this unexpected behaviour the UI REST API is doing a second call to fetch the rest of the object versions. This results in the not so nice behaviour that objects are displayed several times in the UI (because the UI REST API appends each object from the response to a list). This behaviour only appears when prefixes are used which effectively are "virtual folders".
To reproduce that issue create a versioned bucket and add the following objects:
Use this Python code to see what happens.
Run the example app.
The
aws
CLI command does not show this behaviour, but i do not know how this is running the query under the hood.