rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

investigate if computing count instead of drop duplicates would be fast #47

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

having nb of occurrences per sample would be useful

seems drop duplicate is hash aggregate, so group by + avg may be the same