GDrive: High CPU usage on downloading multiple files parallely

SangeethaJanakiraman commented 7 months ago

Hi,

I am trying to download files of 5 users from a single process. For each user, 20 threads are used to parallelize file download. (i.e. total of 100 threads (5 users * 20 thread) are downloading at a time) Download is done using MediaDownloader. I reduced the chunk size to 80KB to avoid LOH and GC time.

CPU - 4 core. Memory: 16GB

As seen in screenshot below, avg CPU usage by this process is 28% .

Memory consumption is ~100 - 200MB.

Quota is within the limit

Latency is also not high

Some requests are getting redirected. I am not sure if this matters

When I tried with PerfView analyzer, it only shows the callstack of SSlStream, GZipStream etc.

Can you please clarify if this high CPU usage is expected? The CPU is on avg 10 - 25% usage throughout the duration of download which is running for total of 1 hour. During scan phase, it looks normal but shoots up once the download starts for all the users.

Are there anyways I can reduce CPU usage (but not affecting throughput) Let me know if need any further information,

Thanks

amanda-tarafa commented 7 months ago

Please include a minimal but complete console application that reproduces the problem, that will help us be certain that we are testing exactly the same scenario as you.

A few general things to consider though:

How many cores does your CPU has? You are performing I/O operations, so in general you probably will benefit from more threads than cores, but that doesn't mean that the more threads, the better. How did you arrive to the 20 threads per user number? Have you tested with less threads? More threads?
In general, the more threads you have, the more CPU usage you should expect, you are precisely increasing the amount of threads so that CPU is not idle while performing I/O, right?
The libraries in this repo support async versions for all service operations. Have you considered rewriting your application so that it starts Tasks instead of Threads. The .NET runtime will then manage threads and schedulle tasks in a very efficient manner.

SangeethaJanakiraman commented 7 months ago

GDriveDownloadTest.zip Above is the sample code.

Please do the following, before executing the application.

Service account credentials are expected in C:\gcreds.json
Create a folder GDriveDownloadTest under C:\
Replace the user1 to user5 in Program.cs with actual smtp address. In my case, each was having 20k to 60k files.

My CPU has 4 cores. I tested with 10 threads, where CPU usage avg about 10%. (though occasional spikes to 25%) We are experimenting with different number of threads. Since with 20 threads, requests are not getting throttled, we decided to go with this number. For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.

amanda-tarafa commented 7 months ago

Since with 20 threads, requests are not getting throttled, we decided to go with this number.

What exactly do you mean by "not getting throttled"?

For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.

Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.

And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.

I will take a look later today to your code and library code and see if anything seems out of the ordinary. I'll report back with my findings. If you could answer the questions above that'd be helpful.

jskeet commented 7 months ago

One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.

SangeethaJanakiraman commented 7 months ago

One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.

I tried disabling gzip. But it did not affect CPU usage much. Avg usage reduced by 2% only.

What exactly do you mean by "not getting throttled"? I meant that I am not getting 403 ( rateLimitExceeded) error at any time. That implies that requests are within the Qutoa limit. isnt it?

Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.

Yes agree. But I just want to keep my sample program simple with only download logic just to check the usage. Since the sample program itself takes more CPU, my complicated production logic obviously will take more as I also get metadata for all the files and then parallelly download them.

And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.

In our production code, multiple such process may be running at any point of time. If 1 process takes 25%, then multiple processes on production can take 100% CPU.

This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.

Thanks

amanda-tarafa commented 7 months ago

This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.

I haven't had time to look yet, I'll report back here when I know more.

amanda-tarafa commented 7 months ago

OK, so I've looked some. First, the application you sent is not really a minimal reproduction. Threre's too much going on there, and definetely some of your code may be having an impact in performance. For instance:

You are using the sync versions of the methods, which means that threads block waiting for IO. Parallelizing on top of that has no effect.
You are filtering the list client side, where you could probably use the Files.List instead of Changes.List, as it doesn't seem you are doing anything specifically with the change information itself, just using that to download the file? Files.List has a Q field and one of the fields you can include there is the modifiedTime.
But what's more, you call to ToList on a result of the query, and then iterate over the whole thing again.
You are calling Task.Result on a couple of places. That blocks threads that could be doing something else.

I didn't run your code, it wouldn't have been useful for me to determine if there was an issue with the Google.Apis.Drive.v3 library.

What I did was I ran the following code with batchsize set to 1, 20, 50, 100. I used Visual Studio's Performance Profiler and here are the results:

Downloading files in parallel batches of 1.
Downloaded 1000 out of 1000 in 19.403887711666666 minutes.
Peak of 8% avg of 3%

Downloading files in parallel batches of 20.
Downloaded 1000 out of 1000 in 2.0180067916666666 minutes.
Peak of 11% avg of 5%

Downloading files in parallel batches of 50.
Downloaded 1000 out of 1000 in 1.9617455133333332 minutes.
Peak of 12% avg of 6%

Downloading files in parallel batches of 100.
Downloaded 1000 out of 1000 in 1.6262795116666666 minutes.
Peak of 14% avg of 6%

None of this is formal benchmarking, but the results seem very reasonable to me.

Some notes about my code:

I didn't try with 4 different users as that won't make a difference to CPU performance. I uploaded 1000 files of random size between 1MB and 3MB, to my own drive. The commented out code at the end does that.
I'm not using Files.Export, instead I'm using Files.Get, just because it was more convenient and it wouldn't affect performance. But do note that you don't need to directly use the export URL (that's meant to be used directly by a browser). The code for exporting a file it's simpler, and very similar to the Files.Get code (full example here):
```
var exportRequest = service.Files.Export(file.Id, "<content-type>");
exportRequest.MediaDownloader.ChunkSize = 80 * 1024;
var progress = await exportRequest.DownloadAsync(fileStream);
```

This is my code:

using Google.Apis.Auth.OAuth2;
using Google.Apis.Download;
using Google.Apis.Drive.v3;
using Google.Apis.Services;
using System.Diagnostics;
using DriveFile = Google.Apis.Drive.v3.Data.File;

var clientSecretsPath = Environment.GetEnvironmentVariable("TEST_CLIENT_SECRET_FILENAME");
var clientSecrets = await GoogleClientSecrets.FromFileAsync(clientSecretsPath);
var folderId = "<the-folder-id-in-drive-to-store-files>";
string contentType = "application/octet-stream";

UserCredential credential = await GoogleWebAuthorizationBroker.AuthorizeAsync(
    clientSecrets.Secrets,
    new[] { DriveService.ScopeConstants.Drive },
    "user-drive-download",
    CancellationToken.None
);

var service = new DriveService(new BaseClientService.Initializer()
{
    HttpClientInitializer = credential,
});

string? nextPageToken = null;
var listRequest = service.Files.List();
listRequest.IncludeItemsFromAllDrives = false;
listRequest.Q = $"trashed = false and '{folderId}' in parents";
listRequest.Fields = "nextPageToken, files(id, name)";
listRequest.PageSize = 100;

int batchSize = 100;
int downloaded = 0;

Console.WriteLine($"Downloading files in parallel batches of {batchSize}.");

Stopwatch stopwatch = Stopwatch.StartNew();

do
{

    listRequest.PageToken = nextPageToken;
    var listResponse = await listRequest.ExecuteAsync();
    nextPageToken = listResponse.NextPageToken;

    var fileBatches = listResponse.Files.Chunk(batchSize);
    foreach (var batch in fileBatches)
    {
        var downloaders = batch.Select(async file =>
        {
            using var fileStream = File.Create(@$"downloaded\{file.Name}");

            var getRequest = service.Files.Get(file.Id);
            getRequest.MediaDownloader.ChunkSize = 80 * 1024;

            var progress = await getRequest.DownloadAsync(fileStream);
            if (progress.Status == DownloadStatus.Failed)
            {
                Console.WriteLine($"Failed downloading {file.Id} with message {progress.Exception?.Message}.");
            }
            else
            {
                downloaded++;
            }
        });

        await Task.WhenAll(downloaders);
    }
}
while (nextPageToken is not null);

stopwatch.Stop();

Console.WriteLine($"Downloaded {downloaded} out of 1000 in {stopwatch.Elapsed.TotalMinutes} minutes.");

//var random = new Random();
//var oneMbInBytes = 1024 * 1024;
//var threeMbInBytes = 3 * oneMbInBytes + 1; // plus one because the top boundary of the range is not inclusive.
//var folderIds = new List<string> { folderId };
//Console.WriteLine("Uploading 1000 files of 1MB to 3MB");

//for (int i = 0; i < 100; i++)
//{

//    IEnumerable<Task> uploaders = Enumerable.Range(0, 10).Select(j =>
//    {
//        var stream = GenerateData();
//        var mediaUploader = service.Files.Create(
//            new DriveFile
//            {
//                Name = $"test_file_{10 * i + j}",
//                Parents = folderIds
//            },
//            stream,
//            contentType);
//        return mediaUploader.UploadAsync();
//    });
//    await Task.WhenAll(uploaders);
//}

//MemoryStream GenerateData()
//{
//    int size = random.Next(oneMbInBytes, threeMbInBytes);
//    byte[] data = new byte[size];
//    random.NextBytes(data);

//    return new MemoryStream(data);
//}

Bottom line, I don't think there's anything wrong with Google.Apis.Drive.v3, instead, there are a few aspects of your code that are possibly impacting performance.

My advice is that you bechmark your whole code, start by using the Performance Profiler to find hot paths, and remove those, etc. Then move on to more formal benchmarking so you can tweak parameters, in my code batchSize, to achieve the best balance between throughput and CPU usage, etc. I would strongly advice you to use the async versions of the library methods, instead of trying to control threads or anything else through parallelization, etc.. Then, as it is done in my code, you only need to make a decision on how many active tasks you want at any given time.

I'll leave this issue open for a few days, waiting for your aknowledgement, but unless/until you find hard evidence that there's a significant performance issue with the libraries, we won't be looking into this further.

SangeethaJanakiraman commented 7 months ago

Thanks, Amanda, for your inputs.

Though my sample code uses Task.Result which blocks the thread, my actual production code uses async throughout since it is using .NET Channel. One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads. Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.

Also, what is the configuration of your machine (CPU and memory) on which you profiled the test code? Anyways, since your test confirms there is no performance issue with Drive API, I will take it up and check more on my side.

Again, thank you so much for confirming this.

amanda-tarafa commented 7 months ago

One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads. Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.

This means that you end up with more than 100 parallel downloads almost certainly, right? So, your pages being of size 1000 (at least on the code you shared) means that you potentially have 1000 parallel downloads?

The machine I tried this on had, same as yours, 4 core CPU and 16GB of memory. It was idle at the time of running these test.

SangeethaJanakiraman commented 7 months ago

Actually, I have max of 20 threads only. They way it works is, Single producer thread -> does the scan, keeps adding the item to bounded channel (size = 100) 20 consumer thread -> reads the item from channel, download and continue the loop. (the moment download thread takes the item, scan will add more, so the queue will be full till the end)

So, at any time, I have only max of 20 threads doing the download. But with the above design, there is no pause in downloads between scan pages. It keeps on downloading till the end.

Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?

amanda-tarafa commented 7 months ago

the moment download thread takes the item, scan will add more, so the queue will be full till the end

Yes, I see what you mean, whereas I have at most 20 parallel downloads you have always 20 threads downloading. I still wouldn't think that's a reason for the difference in performance though. I would still look first at some of the aspects I mentioned in my previous comment. In particular I think that sifting from manging your own threads (with sync or async versions of the methods) to relying on the scheduller to execute tasks will make a difference.

Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?

This is a question better suited for the Drive API team through their support channels. I don't know if there's a problem with using the export link URL directly. What I can say is that the export link URL is different from the URL that calling Files.Export(...).Download(...) would use. See for instance, for the same document.

The Export operation uses

https://www.googleapis.com/drive/v3/files/<readacted_file_id>/export?mimeType=application%2Fvnd.openxmlformats-officedocument.wordprocessingml.document

where the export link for that mime type is:

https://docs.google.com/feeds/download/documents/export/Export?id=<readacted_file_id>&resourcekey=<redacted_resource_key>&exportFormat=docx

googleapis / google-api-dotnet-client