Closed SangeethaJanakiraman closed 7 months ago
Please include a minimal but complete console application that reproduces the problem, that will help us be certain that we are testing exactly the same scenario as you.
A few general things to consider though:
GDriveDownloadTest.zip Above is the sample code.
Please do the following, before executing the application.
My CPU has 4 cores. I tested with 10 threads, where CPU usage avg about 10%. (though occasional spikes to 25%) We are experimenting with different number of threads. Since with 20 threads, requests are not getting throttled, we decided to go with this number. For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.
Since with 20 threads, requests are not getting throttled, we decided to go with this number.
What exactly do you mean by "not getting throttled"?
For this sample program, I have used Parallel.ForEach, but my actual code uses .NET channel with calls to async versions of API. Even there, we see CPU spikes, so I wrote a sample program to narrow down.
Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.
And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.
I will take a look later today to your code and library code and see if anything seems out of the ordinary. I'll report back with my findings. If you could answer the questions above that'd be helpful.
One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.
One thing to consider: if gzip is enabled, I suspect all the responses will be compressed, which obviously takes CPU to decompress - and is pointless if these are all things like videos, images etc. It's possible this isn't used by media downloads, but would be worth checking.
I tried disabling gzip. But it did not affect CPU usage much. Avg usage reduced by 2% only.
What exactly do you mean by "not getting throttled"? I meant that I am not getting 403 ( rateLimitExceeded) error at any time. That implies that requests are within the Qutoa limit. isnt it?
Honestly, those two different approaches to writing you code will give you, possbily, entirely different results. Basically, it's unlikely that you can narrow down whatever is happening in approach A, using approach B.
Yes agree. But I just want to keep my sample program simple with only download logic just to check the usage. Since the sample program itself takes more CPU, my complicated production logic obviously will take more as I also get metadata for all the files and then parallelly download them.
And what is the CPU usage threshold you'd consider "normal"? I'm not that surprise to see 10%-25% usage with 100 threads making HTTP requests and downloading content.
In our production code, multiple such process may be running at any point of time. If 1 process takes 25%, then multiple processes on production can take 100% CPU.
This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.
Thanks
This whole exercise is to fine tune the numbers (decide number of users per process, number of thread per user, number of parallel processes at any time etc ). So, do you have any suggestions based on your testing.
I haven't had time to look yet, I'll report back here when I know more.
OK, so I've looked some. First, the application you sent is not really a minimal reproduction. Threre's too much going on there, and definetely some of your code may be having an impact in performance. For instance:
modifiedTime
.I didn't run your code, it wouldn't have been useful for me to determine if there was an issue with the Google.Apis.Drive.v3 library.
What I did was I ran the following code with batchsize
set to 1, 20, 50, 100. I used Visual Studio's Performance Profiler and here are the results:
Downloading files in parallel batches of 1.
Downloaded 1000 out of 1000 in 19.403887711666666 minutes.
Peak of 8% avg of 3%
Downloading files in parallel batches of 20.
Downloaded 1000 out of 1000 in 2.0180067916666666 minutes.
Peak of 11% avg of 5%
Downloading files in parallel batches of 50.
Downloaded 1000 out of 1000 in 1.9617455133333332 minutes.
Peak of 12% avg of 6%
Downloading files in parallel batches of 100.
Downloaded 1000 out of 1000 in 1.6262795116666666 minutes.
Peak of 14% avg of 6%
None of this is formal benchmarking, but the results seem very reasonable to me.
Some notes about my code:
I'm not using Files.Export, instead I'm using Files.Get, just because it was more convenient and it wouldn't affect performance. But do note that you don't need to directly use the export URL (that's meant to be used directly by a browser). The code for exporting a file it's simpler, and very similar to the Files.Get code (full example here):
var exportRequest = service.Files.Export(file.Id, "<content-type>");
exportRequest.MediaDownloader.ChunkSize = 80 * 1024;
var progress = await exportRequest.DownloadAsync(fileStream);
This is my code:
using Google.Apis.Auth.OAuth2;
using Google.Apis.Download;
using Google.Apis.Drive.v3;
using Google.Apis.Services;
using System.Diagnostics;
using DriveFile = Google.Apis.Drive.v3.Data.File;
var clientSecretsPath = Environment.GetEnvironmentVariable("TEST_CLIENT_SECRET_FILENAME");
var clientSecrets = await GoogleClientSecrets.FromFileAsync(clientSecretsPath);
var folderId = "<the-folder-id-in-drive-to-store-files>";
string contentType = "application/octet-stream";
UserCredential credential = await GoogleWebAuthorizationBroker.AuthorizeAsync(
clientSecrets.Secrets,
new[] { DriveService.ScopeConstants.Drive },
"user-drive-download",
CancellationToken.None
);
var service = new DriveService(new BaseClientService.Initializer()
{
HttpClientInitializer = credential,
});
string? nextPageToken = null;
var listRequest = service.Files.List();
listRequest.IncludeItemsFromAllDrives = false;
listRequest.Q = $"trashed = false and '{folderId}' in parents";
listRequest.Fields = "nextPageToken, files(id, name)";
listRequest.PageSize = 100;
int batchSize = 100;
int downloaded = 0;
Console.WriteLine($"Downloading files in parallel batches of {batchSize}.");
Stopwatch stopwatch = Stopwatch.StartNew();
do
{
listRequest.PageToken = nextPageToken;
var listResponse = await listRequest.ExecuteAsync();
nextPageToken = listResponse.NextPageToken;
var fileBatches = listResponse.Files.Chunk(batchSize);
foreach (var batch in fileBatches)
{
var downloaders = batch.Select(async file =>
{
using var fileStream = File.Create(@$"downloaded\{file.Name}");
var getRequest = service.Files.Get(file.Id);
getRequest.MediaDownloader.ChunkSize = 80 * 1024;
var progress = await getRequest.DownloadAsync(fileStream);
if (progress.Status == DownloadStatus.Failed)
{
Console.WriteLine($"Failed downloading {file.Id} with message {progress.Exception?.Message}.");
}
else
{
downloaded++;
}
});
await Task.WhenAll(downloaders);
}
}
while (nextPageToken is not null);
stopwatch.Stop();
Console.WriteLine($"Downloaded {downloaded} out of 1000 in {stopwatch.Elapsed.TotalMinutes} minutes.");
//var random = new Random();
//var oneMbInBytes = 1024 * 1024;
//var threeMbInBytes = 3 * oneMbInBytes + 1; // plus one because the top boundary of the range is not inclusive.
//var folderIds = new List<string> { folderId };
//Console.WriteLine("Uploading 1000 files of 1MB to 3MB");
//for (int i = 0; i < 100; i++)
//{
// IEnumerable<Task> uploaders = Enumerable.Range(0, 10).Select(j =>
// {
// var stream = GenerateData();
// var mediaUploader = service.Files.Create(
// new DriveFile
// {
// Name = $"test_file_{10 * i + j}",
// Parents = folderIds
// },
// stream,
// contentType);
// return mediaUploader.UploadAsync();
// });
// await Task.WhenAll(uploaders);
//}
//MemoryStream GenerateData()
//{
// int size = random.Next(oneMbInBytes, threeMbInBytes);
// byte[] data = new byte[size];
// random.NextBytes(data);
// return new MemoryStream(data);
//}
Bottom line, I don't think there's anything wrong with Google.Apis.Drive.v3, instead, there are a few aspects of your code that are possibly impacting performance.
My advice is that you bechmark your whole code, start by using the Performance Profiler to find hot paths, and remove those, etc. Then move on to more formal benchmarking so you can tweak parameters, in my code batchSize
, to achieve the best balance between throughput and CPU usage, etc. I would strongly advice you to use the async versions of the library methods, instead of trying to control threads or anything else through parallelization, etc.. Then, as it is done in my code, you only need to make a decision on how many active tasks you want at any given time.
I'll leave this issue open for a few days, waiting for your aknowledgement, but unless/until you find hard evidence that there's a significant performance issue with the libraries, we won't be looking into this further.
Thanks, Amanda, for your inputs.
Though my sample code uses Task.Result which blocks the thread, my actual production code uses async throughout since it is using .NET Channel. One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads. Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.
Also, what is the configuration of your machine (CPU and memory) on which you profiled the test code? Anyways, since your test confirms there is no performance issue with Drive API, I will take it up and check more on my side.
Again, thank you so much for confirming this.
One difference I do note in my approach vs yours is, you start 100 parallel downloads, wait for all of them to complete before you get the next page and start the next set of parallel downloads. Whereas my production works on already scanned items, so there is no wait time on these parallel downloads.
This means that you end up with more than 100 parallel downloads almost certainly, right? So, your pages being of size 1000 (at least on the code you shared) means that you potentially have 1000 parallel downloads?
The machine I tried this on had, same as yours, 4 core CPU and 16GB of memory. It was idle at the time of running these test.
Actually, I have max of 20 threads only. They way it works is, Single producer thread -> does the scan, keeps adding the item to bounded channel (size = 100) 20 consumer thread -> reads the item from channel, download and continue the loop. (the moment download thread takes the item, scan will add more, so the queue will be full till the end)
So, at any time, I have only max of 20 threads doing the download. But with the above design, there is no pause in downloads between scan pages. It keeps on downloading till the end.
Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?
the moment download thread takes the item, scan will add more, so the queue will be full till the end
Yes, I see what you mean, whereas I have at most 20 parallel downloads you have always 20 threads downloading. I still wouldn't think that's a reason for the difference in performance though. I would still look first at some of the aspects I mentioned in my previous comment. In particular I think that sifting from manging your own threads (with sync or async versions of the methods) to relying on the scheduller to execute tasks will make a difference.
Also, I have another question regarding your suggestion to use Export method. It has the size limitation of 10MB, isn't it ? I think that is the reason ExportLinks was used to download. Do you see any issue with that?
This is a question better suited for the Drive API team through their support channels. I don't know if there's a problem with using the export link URL directly. What I can say is that the export link URL is different from the URL that calling Files.Export(...).Download(...) would use. See for instance, for the same document.
The Export operation uses
https://www.googleapis.com/drive/v3/files/<readacted_file_id>/export?mimeType=application%2Fvnd.openxmlformats-officedocument.wordprocessingml.document
where the export link for that mime type is:
https://docs.google.com/feeds/download/documents/export/Export?id=<readacted_file_id>&resourcekey=<redacted_resource_key>&exportFormat=docx
Hi,
I am trying to download files of 5 users from a single process. For each user, 20 threads are used to parallelize file download. (i.e. total of 100 threads (5 users * 20 thread) are downloading at a time) Download is done using MediaDownloader. I reduced the chunk size to 80KB to avoid LOH and GC time.
CPU - 4 core. Memory: 16GB
As seen in screenshot below, avg CPU usage by this process is 28% .
Memory consumption is ~100 - 200MB.
Quota is within the limit
Latency is also not high
Some requests are getting redirected. I am not sure if this matters
When I tried with PerfView analyzer, it only shows the callstack of SSlStream, GZipStream etc.
Can you please clarify if this high CPU usage is expected? The CPU is on avg 10 - 25% usage throughout the duration of download which is running for total of 1 hour. During scan phase, it looks normal but shoots up once the download starts for all the users.
Are there anyways I can reduce CPU usage (but not affecting throughput) Let me know if need any further information,
Thanks