rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

add some options to make it possible to get other stuff than images #22

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

mp3, pdf, mp4, platform links, ..

rom1504 commented 1 year ago

options:

  1. let the user provide their own function
  2. provide N filters that are reasonable (eg audio, video, document, image)
  3. let the user provide a list of extension

I'm leaning towards 2 Then there will be a modality option to pick the one the user like.

rom1504 commented 1 year ago

make an analysis notebook/example for people to easily understand filters and what kind of content they can get

rom1504 commented 1 year ago

document_type now a param

rom1504 commented 1 year ago

done for audio