terricain / aioboto3

Wrapper to use boto3 resources with the aiobotocore async backend
Apache License 2.0
719 stars 74 forks source link

Slow parallel upload #343

Closed albertalexandrov closed 1 month ago

albertalexandrov commented 1 month ago

Hi there!

I need to upload multiple files to S3 in parallel. The script I use to upload:

import asyncio

from httpx import AsyncClient

async def upload():
    async with AsyncClient(base_url="http://localhost:8002", verify=False) as client:
        files = {"file": ("file.txt", "c29tZSBjb250ZW50Cg==")}
        await client.post("/api/v1/esm/test", files=files, timeout=60)

async def main():
    coros = [upload() for _ in range(10)]
    await asyncio.gather(*coros)

if __name__ == '__main__':
    asyncio.run(main())

And there is an attachment service I upload to. If I upload files like that:

from fastapi import FastAPI, UploadFile

app = FastAPI()

@router.post("/test")
async def test_upload(file: UploadFile):
    start = time.time()
    session = aioboto3.Session(
        aws_access_key_id="9HiK2uBRB34jI4Xn",
        aws_secret_access_key="nb3fv7qtqa3oxgBpFTycDlfxqpItas7e",
    )
    async with session.client(service_name="s3", endpoint_url="http://localhost:9000") as s3:
        await s3.upload_fileobj(
            Fileobj=file.file,
            Bucket="open-info",
            Key=str(uuid4()),
            ExtraArgs={"ContentType": file.content_type}
        )
    print("Upload duration", time.time() - start)

then I get next results:

Upload duration 1.685148000717163
Upload duration 1.332367181777954
Upload duration 1.1115410327911377
Upload duration 0.9266300201416016
Upload duration 0.7629110813140869
Upload duration 0.5591249465942383
Upload duration 1.2253949642181396
Upload duration 0.2330770492553711
Upload duration 0.6720459461212158
Upload duration 0.34050798416137695

As you could notice it's rather slow for uploading 20 bytes.

Then if I do not create session each time but create it once:

from fastapi import FastAPI, UploadFile

BOTO_SESSION = None

@app.post("/test")
async def test_upload(file: UploadFile):
    start = time.time()
    global BOTO_SESSION
    if not BOTO_SESSION:
        BOTO_SESSION = aioboto3.Session(
            aws_access_key_id="9HiK2uBRB34jI4Xn",
            aws_secret_access_key="nb3fv7qtqa3oxgBpFTycDlfxqpItas7e",
        )
    async with BOTO_SESSION.client(service_name="s3", endpoint_url="http://localhost:9000") as s3:
        await s3.upload_fileobj(
            Fileobj=file.file,
            Bucket="open-info",
            Key=str(uuid4()),
            ExtraArgs={"ContentType": file.content_type}
        )
    print("Upload duration", time.time() - start)

It becomes much faster:

Upload duration 0.41297316551208496
Upload duration 0.22896623611450195
Upload duration 0.22276878356933594
Upload duration 0.21547317504882812
Upload duration 0.21387410163879395
Upload duration 0.23035597801208496
Upload duration 0.24096202850341797
Upload duration 0.2120680809020996
Upload duration 0.22047185897827148
Upload duration 0.23693513870239258

What's wrong with first case?

terricain commented 1 month ago

Overall, the S3 upload_fileobj implementations are not equal, they're similar but not equal, as boto3 calls out to the s3transfer module, and here we attempt to replicate it as best as we can. That being said, I'd expect it to take less than a second, so that seems a bit odd.

Try caching and reusing the async client, there's no need to make a client on every request, same applies to the session. Look into setting up an AsyncExitStack, (that or just calling session.client().__aenter__(), though thats slightly less clean)

albertalexandrov commented 1 month ago

I create a session once at the start of application and use it. It helped.