robinrodricks / FluentStorage

A polycloud .NET cloud storage abstraction layer. Provides Blob storage (AWS S3, GCP, FTP, SFTP, Azure Blob/File/Event Hub/Data Lake) and Messaging (AWS SQS, Azure Queue/ServiceBus). Supports .NET 5+ and .NET Standard 2.0+. Pure C#.
MIT License
263 stars 33 forks source link

Allow directory recursion to happen on S3/MinIO (rather than locally) #68

Closed NickHarmer closed 1 month ago

NickHarmer commented 2 months ago

Description

When recursing into multiple subdirectories on a locally-hosted MinIO server, I have experienced a lot of unreliability in terms of the number of files reported by ListFolderAsync(). After some investigation, I determined that there was an issue in the local recursion code, although I have not been able to identify the exact cause. The problem seems to be here:

await Task.WhenAll(folders.Select(f => ListFolderAsync(container, f.FullPath, options, cancellationToken))).ConfigureAwait(false);

If I change this to:

foreach(var folder in folders)
{
    await ListFolderAsync(container, folder.FullPath, options, cancellationToken).ConfigureAwait(false)
}

or simply change the AsyncLimiter value from 10 down to 1, everything works reliably (but much more slowly!)

But there is also an option to have the recursion done on the S3 server itself. In this case it just returns all the files names to the client using the NextContinuationToken mechanism. In my tests, this also works reliably, and no more slowly than the local recursion method.

So, this PR adds additional properties to the ListOptions class, to allow the user to specify local or remote recursion, and also to control the number of parallel tasks used for local recursion. The defaults remain as before.

--- UPDATE ---

Attached is a minimal reproduction solution. Usage:

minio-repro <bucket name> <api key> <secret key> <s3 key> (<serviceUrl> | <region>)

<s3 key> is the s3 key of a folder inside the bucket containing test data

The code will query the folder 10 times in succession using Storage.ListFilesAsync() with recursion enabled and report the number of files returned on each query. In my testing, with a bucket containing 320 files spread across multiple folders I get this:

d:\projects\minio-repro>minio-repro <bucket name> <api key> <secret key> <s3 key> <minio server url>
Attempt #1 ... files: 317
Attempt #2 ... files: 319
Attempt #3 ... files: 316
Attempt #4 ... files: 319
Attempt #5 ... files: 319
Attempt #6 ... files: 318
Attempt #7 ... files: 319
Attempt #8 ... files: 319
Attempt #9 ... files: 318
Attempt #10 ... files: 314

minio-repro.zip

robinrodricks commented 1 month ago

This is a fantastic contribution.

I want to request a follow up PR. Can we have the Remote recursion set as default for S3 and MinIO and all the places where it is natively supported? That would be great, thanks.

NickHarmer commented 1 month ago

Robin

OK - I'll do it shortly

Nick

robinrodricks commented 1 month ago

Hey nick, could you let me know how I can accomplish this? I can put in some time.

Can we have the Remote recursion set as default for S3 and MinIO and all the places where it is natively supported?

NickHarmer commented 2 weeks ago

Robin

I've looked into this, and S3/MinIO appears to be the only provider for which this is configurable. Azure/GCP/FTP always recurse remotely, and SFTP/Disk/ZIP always recurse locally.

I'll submit a PR to change the default in ListOptions from Local to Remote

robinrodricks commented 2 weeks ago

Wonderful!