How can you map the common crawl source back to metadata?

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.53k stars 346 forks source link

How can you map the common crawl source back to metadata? #31

Closed craigschmidt closed 1 year ago

craigschmidt commented 1 year ago

The common crawl data entries have a source like this:

"source":"cc/2023-06/en_head_0000.json.gz/line401859"

What's the right way to map that back to metadata where the entry came from? In particular I'd like the original url and timestamp it was downloaded. Is that possible? Most of the metadata seems to be in terms of the WARC format, not the WET format I believe was used by cc_net to process the data.

Thanks, Craig

mauriceweber commented 1 year ago

Hi Craig, this is a great point! I saw the discussion on discord -- we are working on a feature that allows to store more metadata for the commoncrawl slice (see issue #39 ).

mauriceweber commented 1 year ago

To answer your question here for completeness: currently, we cannot map an individual record back to the original url / perplexity score etc., as this data is not contained in the current version of the dataset.