rgw/sfs: `list_object_versions` S3 API call returns truncated flag even if it is not necessary

When calling the AWS S3 API function list_object_versions the response of a really simple bucket with 4 objects is truncated. IMO this should be only one call because the max keys of 1000 is not reached. Because of this unexpected behaviour the UI REST API is doing a second call to fetch the rest of the object versions. This results in the not so nice behaviour that objects are displayed several times in the UI (because the UI REST API appends each object from the response to a list). This behaviour only appears when prefixes are used which effectively are "virtual folders".

To reproduce that issue create a versioned bucket and add the following objects:

test01 bucket
|- test
|- test1
|- test2
|- b
   |- test

Use this Python code to see what happens.

# pip install asyncio aiobotocore types-aiobotocore-s3 aioboto3 boto3
import asyncio
import aiobotocore
import boto3.utils

from aiobotocore.session import AioSession
from botocore.config import Config as S3Config
from types_aiobotocore_s3.client import S3Client
from types_aiobotocore_s3.type_defs import ListObjectVersionsOutputTypeDef

ACCESS_KEY='test'
SECRET_KEY='test'
ENDPOINT_URL='http://127.0.0.1:7480'

async def do_test():
    session = AioSession()
    session.register(
        "creating-client-class.s3",
        boto3.utils.lazy_call(
            "aioboto3.s3.inject.inject_s3_transfer_methods"
        ),
    )

    client: S3Client
    async with session.create_client(
        "s3",
        endpoint_url=ENDPOINT_URL,
        aws_access_key_id=ACCESS_KEY,
        aws_secret_access_key=SECRET_KEY,
        verify=False,
        config=S3Config(
            retries={
                "max_attempts": 1,
                "mode": "standard",
            },
            s3={
                "addressing_style": "auto",
            },
        ),
    ) as client:
        # First call.
        s3_res: ListObjectVersionsOutputTypeDef = (
            await client.list_object_versions(
                Bucket="test01",
                Prefix="",
                Delimiter="/",
                KeyMarker="",
            )
        )
        print(f'Is truncated: {s3_res["IsTruncated"]}')
        for version in s3_res["Versions"]:
            print(f'  Object key: {version["Key"]}')

        # Second call which shouldn't be necessary.
        s3_res: ListObjectVersionsOutputTypeDef = (
            await client.list_object_versions(
                Bucket="test01",
                Prefix="",
                Delimiter="/",
                KeyMarker=s3_res["NextKeyMarker"],
            )
        )
        print(f'Is truncated: {s3_res["IsTruncated"]}')
        for version in s3_res["Versions"]:
            print(f'  Object key: {version["Key"]}')

asyncio.run(do_test())

Run the example app.

$ python3 s3test.py
Is truncated: True
  Object key: test
  Object key: test1
  Object key: test2
Is truncated: False
  Object key: test
  Object key: test1
  Object key: test2

The aws CLI command does not show this behaviour, but i do not know how this is running the query under the hood.

$ aws s3api list-object-versions --endpoint=http://127.0.0.1:7480 --bucket "test01" --prefix ""
{
    "Versions": [
        {
            "ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
            "Size": 26,
            "StorageClass": "STANDARD",
            "Key": "b/test",
            "VersionId": ".KVM6T5aMGJjqKUAF-lGyEqtfXdyw9t",
            "IsLatest": true,
            "LastModified": "2023-11-24T09:57:09.971Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
            "Size": 26,
            "StorageClass": "STANDARD",
            "Key": "test",
            "VersionId": "yiHyTfim5CK3T8x-GjjnkVwuLVGCNVX",
            "IsLatest": true,
            "LastModified": "2023-11-24T09:56:30.338Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"13e90c08734c1fb6faacf5cc62968a9d-1\"",
            "Size": 205,
            "StorageClass": "STANDARD",
            "Key": "test1",
            "VersionId": "E3DByaX-2OC8b6EnV5.xML37IIDT3oX",
            "IsLatest": true,
            "LastModified": "2023-11-24T09:56:30.337Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"66f0fb190ead6b1e2d1e880b0ce71795-1\"",
            "Size": 654,
            "StorageClass": "STANDARD",
            "Key": "test2",
            "VersionId": "Z6vswRSqqBtjq7hfH.TSXpU0oz6WV87",
            "IsLatest": true,
            "LastModified": "2023-11-24T09:56:30.327Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        }
    ]
}

Peek 2023-11-24 12-51

I see the query string for the command is different when running the python script and the aws cli.

With the python script:

2023-11-24T14:46:11.359+0100 7efd419496c0 20 QUERY_STRING=versions&prefix=&delimiter=%2F&key-marker=&encoding-type=url

With aws cli

2023-11-24T14:49:28.599+0100 7efd3c13e6c0 20 QUERY_STRING=versions&prefix=&encoding-type=url

Difference is that the python tool is passing a delimiter, while the aws cli command is not.

Adding extra info: I can see 2 queries issued for the python script while 1 for the aws cli command.

Adding: --delimiter / to the aws cli command triggers 2 queries to the backend.

It does not make any difference if the bucket is versioned or not.

When running with --delimiter "/" all objects are duplicated as well.

$ aws s3api list-object-versions --endpoint=http://127.0.0.1:7480 --bucket "test01" --prefix "" --delimiter "/"
{
    "Versions": [
        {
            "ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
            "Size": 26,
            "StorageClass": "STANDARD",
            "Key": "test",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.129Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"13e90c08734c1fb6faacf5cc62968a9d-1\"",
            "Size": 205,
            "StorageClass": "STANDARD",
            "Key": "test1",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.125Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"66f0fb190ead6b1e2d1e880b0ce71795-1\"",
            "Size": 654,
            "StorageClass": "STANDARD",
            "Key": "test2",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.129Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"8d1c84a7fbe1dbc559c6b5c63fa184fc-1\"",
            "Size": 26,
            "StorageClass": "STANDARD",
            "Key": "test",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.129Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"13e90c08734c1fb6faacf5cc62968a9d-1\"",
            "Size": 205,
            "StorageClass": "STANDARD",
            "Key": "test1",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.125Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        },
        {
            "ETag": "\"66f0fb190ead6b1e2d1e880b0ce71795-1\"",
            "Size": 654,
            "StorageClass": "STANDARD",
            "Key": "test2",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2023-11-24T14:23:35.129Z",
            "Owner": {
                "DisplayName": "M. Tester",
                "ID": "testid"
            }
        }
    ],
    "CommonPrefixes": [
        {
            "Prefix": "b/"
        }
    ]
}

The issue is caused by the following logic:

    if (!results.common_prefixes.empty()) {
      std::vector<rgw_bucket_dir_entry> objects_after;
      std::string query = std::prev(results.common_prefixes.end())->first;
      query.append(
          sfs::S3_MAX_OBJECT_NAME_BYTES - query.size(),
          std::numeric_limits<char>::max()
      );
      list.objects(get_bucket_id(), params.prefix, query, 1, objects_after);
      results.is_truncated = objects_after.size() > 0;
    }

That is called when we have common prefixes involved, like in this case.

Problem is that logic for SQLiteList::objects is based on sorting by name only...

This is the query:

auto rows = storage->select(
      columns(
          &DBObject::name, &DBVersionedObject::mtime, &DBVersionedObject::etag,
          sum(&DBVersionedObject::size)
      ),
      inner_join<DBVersionedObject>(
          on(is_equal(&DBObject::uuid, &DBVersionedObject::object_id))
      ),
      where(
          is_equal(&DBVersionedObject::object_state, ObjectState::COMMITTED) and
          is_equal(&DBObject::bucket_id, bucket_id) and
          greater_than(&DBObject::name, start_after_object_name) and
          prefix_to_like(&DBObject::name, prefix)
      ),
      group_by(&DBVersionedObject::object_id)
          .having(is_equal(
              sqlite_orm::max(&DBVersionedObject::version_type),
              VersionType::REGULAR
          )),
      order_by(&DBObject::name), limit(query_limit)
  );

The first call adds the following items:

b/test
test
test1
test2

As we're using delimiter /, b/ is considered a common prefix and the logic above is triggered.

When the code above is called parameters are:

prefix = ""
start_after_object_name = "b/"
max = 1

But the objects test, test1, and test2 have a greater name than b/ so the first of them is obtained from the database.

The last statement in the code above

results.is_truncated = objects_after.size() > 0;

results in results.is_truncated = true.

This makes rgw to trigger another call this time calling list versions with

start_after_object_name = "b/"

And, again... objects test, test1, and test2 have a greater name than b/, so they're re-added.

The s3tests test_bucket_listv2_delimiter_prefix and test_bucket_list_delimiter_prefix pass because there're both requesting for 1 single object every time.

I think we'd need some extra logic that is not only based on name order. What we need is to find out if we have something else after the common prefix, but that was not added before, to determine if the result is truncated or not.

s3gw-tech / s3gw

rgw/sfs: `list_object_versions` S3 API call returns truncated flag even if it is not necessary #838