Closed craigschmidt closed 1 year ago
Hi Craig, this is a great point! I saw the discussion on discord -- we are working on a feature that allows to store more metadata for the commoncrawl slice (see issue #39 ).
To answer your question here for completeness: currently, we cannot map an individual record back to the original url / perplexity score etc., as this data is not contained in the current version of the dataset.
The common crawl data entries have a source like this:
"source":"cc/2023-06/en_head_0000.json.gz/line401859"
What's the right way to map that back to metadata where the entry came from? In particular I'd like the original url and timestamp it was downloaded. Is that possible? Most of the metadata seems to be in terms of the WARC format, not the WET format I believe was used by cc_net to process the data.
Thanks, Craig