m-bain / webvid

Large-scale text-video dataset. 10 million captioned short videos.
597 stars 37 forks source link

Script to download the videos #4

Closed bryant1410 closed 2 years ago

bryant1410 commented 2 years ago

Just saw you released a download script. FWIW, this is what I used to download the 2M version, just wanted to share it. I think it's simpler (it uses csvkit and parallel) but maybe it has fewer features:

csvcut -c videoid,contentUrl results_2M_val.csv \
  | sed 1d \
  | parallel --resume -v --joblog val.log --bar -j 8 'video_id=$(echo {} | cut -d , -f 1) && url=$(echo {} | cut -d , -f 2) && extension=${url##*.} && wget --no-clobber -qO "videos/val/$video_id.$extension" $url'

It downloads with 8 jobs in parallel (the flag -j 8).

bryant1410 commented 2 years ago

I'm closing this cause it's not an actual issue but something I just wanted to share.

bryant1410 commented 2 years ago

Btw, happy to see a script was shared! Simplifies users' lifes

m-bain commented 2 years ago

Yeah I think my way is defo not the simplest or the fastest :') -- yours is a neat one liner. Im looking into img2dataset atm too.

bryant1410 commented 2 years ago

Yeah, I saw you shared img2dataset. Sounds interesting for starting without even pre-downloading the dataset!

RyanMarten commented 1 year ago

@bryant1410 @m-bain I used your script and scaled it to run simultaneously on 1,000 VMs in a GCP batch job: https://github.com/RyanMarten/distributed_gcp_youtube_download Downloads WebVid10m in 10 minutes

jiaxiangc commented 10 months ago

Yeah, I saw you shared img2dataset. Sounds interesting for starting without even pre-downloading the dataset!

Thanks your contribution for quickly downloading videos.