snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
https://snap-research.github.io/Panda-70M/
438 stars 15 forks source link

How to avoid IP blocking? #37

Open vedantroy opened 3 months ago

vedantroy commented 3 months ago

I wrote a downloader using youtube-dlp, but a lot of the IPs get blocked after ~ 10K or so downloads. I'm surprised people are successfully downloading the dataset using the provided downloading script on a single machine, as I would strongly expect YouTube to block after a few gigabytes of data are downloaded.

Are there any proxies / tools / tricks used to download the entire dataset and avoid Youtube blocking?

tsaishien-chen commented 3 months ago

Hi @vedantroy, Thanks for your interest about this dataset! Unfortunately, this is a quite common issue. You can check some discussions like this one. The best solution is: use VPN and get different IPs once you detect your IP is banned. If you don't have a VPN, you can try to slow down the download speed by reducing processes_count and thread_count in the config file and also set a sleep counter after a few downloading steps. Hope this information is helpful!

peiliu0408 commented 1 month ago

Hi @vedantroy, Thanks for your interest about this dataset! Unfortunately, this is a quite common issue. You can check some discussions like this one. The best solution is: use VPN and get different IPs once you detect your IP is banned. If you don't have a VPN, you can try to slow down the download speed by reducing processes_count and thread_count in the config file and also set a sleep counter after a few downloading steps. Hope this information is helpful!

@tsaishien-chen I have been troubled by this IP block issue for quite some time. Is there a template available for implementing a 'sleep counter' after a few download steps?