issues
search
rom1504
/
cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307
stars
23
forks
source link
Revamp cc2dataset warc text extraction
#38
Open
harry-stark
opened
1 year ago
harry-stark
commented
1 year ago
Added lang detection
Added license detector
Added perplexity measures