issues
search
rom1504
/
cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307
stars
23
forks
source link
Use yield to combine main parsing and dedup.
#9
Closed
rom1504
closed
1 year ago
rom1504
commented
1 year ago
Saves space, and is faster (2x)
Saves space, and is faster (2x)