philippta / flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://flyscrape.com
Mozilla Public License 2.0
1.02k stars 29 forks source link

Can this be used for downloading files in parallel? #5

Closed TACIXAT closed 9 months ago

TACIXAT commented 10 months ago

Can this be used for downloading files in parallel?

For example, if I wanted to download 400 gb of image embeddings from - https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/

philippta commented 10 months ago

Unfortunately this feature is not yet supported.

Tracked under:

brianlow commented 9 months ago

axle can download a list of urls. It can download a single file in parallel so I suspect it will download multiple urls in parallel

Some other options c/o chatgpt

GNU Parallel: This is a powerful tool for running jobs in parallel. You can use it to run multiple wget commands at once. Here's a basic example: cat urls.txt | parallel -j 10 wget. This command reads URLs from a file urls.txt and uses GNU Parallel to run 10 wget jobs simultaneously.

xargs: Another option is to use xargs with the -P flag for parallel execution. For example: cat urls.txt | xargs -n 1 -P 10 wget. This runs wget for each URL in urls.txt, with up to 10 downloads in parallel.

TACIXAT commented 9 months ago

Thanks for these. I am a Windows user and a bit of a command prompt purist these days (to force myself to learn).

I usually just implement scraping and downloaders in Python. Been meaning to throw together a parallel downloader in Go. Would be a cool addition here, I'll give this a try next time I am scraping.

On Mon, Nov 13, 2023 at 10:43 PM Brian Low @.***> wrote:

axle https://github.com/axel-download-accelerator/axel can download a list of urls. It can download a single file in parallel so I suspect it will download multiple urls in parallel

Some other options c/o chatgpt

GNU Parallel: This is a powerful tool for running jobs in parallel. You can use it to run multiple wget commands at once. Here's a basic example: cat urls.txt | parallel -j 10 wget. This command reads URLs from a file urls.txt and uses GNU Parallel to run 10 wget jobs simultaneously.

xargs: Another option is to use xargs with the -P flag for parallel execution. For example: cat urls.txt | xargs -n 1 -P 10 wget. This runs wget for each URL in urls.txt, with up to 10 downloads in parallel.

— Reply to this email directly, view it on GitHub https://github.com/philippta/flyscrape/issues/5#issuecomment-1809629839, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHP4RVNHPS6FPIQRJFITFDYEMHI3AVCNFSM6AAAAAA7HZKC4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGYZDSOBTHE . You are receiving this because you authored the thread.Message ID: @.***>

philippta commented 9 months ago

File downloads have been added.

Example: https://github.com/philippta/flyscrape/blob/master/examples/download.js API Reference: https://github.com/philippta/flyscrape#file-downloads