m-bain / webvid

Large-scale text-video dataset. 10 million captioned short videos.
575 stars 35 forks source link

Question about the page_dir. #8

Closed HenryHZY closed 1 year ago

HenryHZY commented 2 years ago

Hi @m-bain I have downloaded the webvid10M_val by your ‘results_10M_val.csv’ and 'download.py'. I would like to ask some questions about the page_dir.

  1. How do you produce the page_dir for each videoid? Just like videoid%10?

  2. What is the function of page_dir? To avoid the limitation of the maximum number of files per directory?

  3. What file order should I use for pre-training? Follow the page_dir from small to big? Follow the results_10M_train.csv from the first row to the last row?

Thank you:)

m-bain commented 1 year ago

Hi sorry I missed this.

  1. Page_dir indicates the page the video was scraped on (typically 200 per page I think).
  2. Yes to avoid too many files per directory, many file systems discourage > 10K files per directory
  3. Pretraining order should be random