huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Metadata process #4

Closed ellenxtan closed 9 months ago

ellenxtan commented 9 months ago

Hey @HugoLaurencon and other OBELICS authors,

Great work, and thanks so much for open sourcing OBELICS!

Since s3://m4-datasets/webdocs does not seem to be publicly available, I was wondering how the metadata_dataset in 01_download_warc.py is created, specifically, what's it's format and features. Is it possible to share the code for that?

For example, starting from the commoncrawl webpage here, there are files of different types such as warc, wat, etc. I was wondering how you process the wat files in order to read the warc files (s3://m4-datasets/webdocs/pointers_cc_dataset/).

aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/
                           PRE crawldiagnostics/
                           PRE robotstxt/
                           PRE warc/
                           PRE wat/
                           PRE wet/

Many thanks!

HugoLaurencon commented 9 months ago

Thanks @ellenxtan for your comment.

The metadata_dataset contains for each row the warc_filename, the warc_record_offset and the warc_record_length.

These 3 items define a pointer to a specific document in a Common Crawl dump. With these, we are able to locate a document and download its WARC file with warc_downloader.py, from which we can extract the HTML code of the web page with html_extractor.py.

These 3 items are also present in the final version of the dataset on the column general_metadata (see an example).

We already knew in advance which document we wanted to target because it was the result of a preliminary deduplication and selection of English content only, but this code is not present on the current repo.

Let me know if it doesn't answer your question!

ellenxtan commented 9 months ago

Thanks @HugoLaurencon for the prompt response and clarification!

May I ask which metadata warc_record_offset and warc_record_length correspond to in the original warc or wat files in CommonCrawl? How are they calculated? Specifically, how the bytes range is obtained. It seems the warc_record_length does not equal to the Content-Length in CommonCrawl's metadata.

Also, I was wondering if there is any plan to open source the code for preliminary deduplication and selection of English content?

Thank you!

HugoLaurencon commented 9 months ago

Sure, in Common Crawl dumps, there are several WARC files, which is a compressed format, and each WARC file contains the data of many web pages.

You can retrieve the data of one specific web page without downloading the whole WARC file (which is big), if you also have information about the location of this data in the WARC file (first, the name of the file the document is included in, which is warc_filename, then from which byte to start to download, which is warc_record_offset, and finally until when, which is the length warc_record_length).

We initially only had a list of URLs we wanted to keep, probably with the ID of the Common Crawl dump too. Then, we retrieved warc_filename, warc_record_offset and warc_record_length for each URL using SQL requests for Common Crawl with AWS Athena.

For the second part of the comment, we will not give the code for these steps because it's a messy code. But the selection of English content only is done with a FastText classifier (it's also present in the filtering part if you want to know how to use it, the function and how to load it). For the preliminary deduplication, it's a MinHash deduplication and there are open source implementations available.

Don't hesitate if you have more questions!

ellenxtan commented 9 months ago

Thanks so much, Hugo!