rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Add diagram of the advised processing pipeline #37

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago
  1. cc2dataset
  2. stage 2 filtering (safety, clip, ...)
  3. any2dataset
  4. pair training (pair contrastive, pair generative)

explain things that can be tweaked at each step and their importance

explain value of the approach (laion5B and similar)