Request to decrease the number of Pool worker processes when downloading files

kan-fu commented 5 months ago

Hi, I am from onc Python client library dev team. Recently a user was blocked (unblocked later) for downloading archivefiles multiple times per second for several days, which caused some trouble for our backend sever. We reached out to him and he explained he was using your pipeline tool for the download tasks. I added a note in the README after the issue.

We appreciate people using our data for their research, and also your effort to develop tools to make the access easier. But could you reduce the number of thread/subprocesses when downloading files? A quick scan of code leads me to this line that uses multiprocessing library to download files (not sure if there are other places). Is it possible to make this number a cmd line argument, with a default value to something like 3?

psmskelton commented 5 months ago

@kan-fu - Thanks for getting in touch. Nice to see the ONC 3.0 API is being actively developed again as it was stagnant for a few years. The multi-threaded download comes from a time and geographic location (Australia) where more threads meant more throughput, and that throughput was still quite bad. I can imagine someone closer to Canada would give your servers a problem.

@lucascesarfd - Are you able to make the following changes:

Add a user input through a parser argument such as:

parser.add_argument(
"--download_threads",
required=False,
type=int,
choices=range(1, 5),
metavar="[1-4]",
help="Maximum number [1-4] of concurrent ONC download threads. Default is 2.",
default=2
)

Pass the new variable through download_files() and eventually download_file_list() to replace the hard-coded 20 threads.

I likely won't have time to get things setup and tested for a few weeks.

lucascesarfd commented 5 months ago

Hi everyone.

Sure! I can make these changes on the code and update the GitHub repository.

As soon as possible I'll update this and make a command line parameter for that.

Thanks,

Lucas Domingos

Obter o Outlook para Androidhttps://aka.ms/AAb9ysg

From: psmskelton @.> Sent: Thursday, May 30, 2024 6:39:58 PM To: lucascesarfd/onc_dataset @.> Cc: Lucas Cesar @.>; Mention @.> Subject: Re: [lucascesarfd/onc_dataset] Request to decrease the number of Pool worker processes when downloading files (Issue #4)

@kan-fuhttps://github.com/kan-fu - Thanks for getting in touch. Nice to see the ONC 3.0 API is being actively developed again as it was stagnant for a few years. The multi-threaded download comes from a time and geographic location (Australia) where more threads meant more throughput, and that throughput was still quite bad. I can imagine someone closer to Canada would give your servers a problem.

@lucascesarfdhttps://github.com/lucascesarfd - Are you able to make the following changes:

Add a user input through a parser argument such as:

parser.add_argument( "--download_threads", required=False, type=int, choices=range(0, 4), metavar="[0-3]", help="Maximum number of concurrent ONC download threads. Default is 2.", default=2 )

Pass the new variable through download_files() and eventually download_file_list() to replace the hard-coded 20 threads.

I likely won't have time to get things setup and tested for a few weeks.

— Reply to this email directly, view it on GitHubhttps://github.com/lucascesarfd/onc_dataset/issues/4#issuecomment-2140904593, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEB2OKHIOBT6XOCSIXQO5I3ZE6MC5AVCNFSM6AAAAABIRKEE7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQHEYDINJZGM. You are receiving this because you were mentioned.Message ID: @.***>

lucascesarfd / onc_dataset

Request to decrease the number of Pool worker processes when downloading files #4