rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
309 stars 23 forks source link

check structured CC extraction #31

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

http://webdatacommons.org/structureddata/#results-2021-1

rom1504 commented 1 year ago

http://webdatacommons.org/structureddata/2021-12/stats/schema_org_subsets.html

rom1504 commented 1 year ago

mostly references of entities, but no way to actually get the entities from there quite interesting for metadata collection however